Mastering In-Memory Data Costs

Introduction

The rise of cloud computing has transformed the landscape for developers, unlocking a new era of productivity and innovation. With robust infrastructure easily accessible, developers can create high-quality products in record time. Scalability, once a daunting task, has become streamlined and efficient.

However, alongside these benefits comes a new challenge: escalating costs. In an era where every dollar counts, managing expenses has become paramount. Controlling cloud costs has become such an acute and difficult-to-solve problem that it has given rise to a whole new function: FinOps. In this post, we will analyze the crucial aspects of cost control in infrastructure management, focusing specifically on in-memory data stores like Redis, Valkey, KeyDB, and Dragonfly.

In the course of our journey at Dragonfly, we've engaged in conversations with numerous teams who have expressed a common desire to reduce their in-memory data expenses, among other objectives. These conversations have exposed me to a wide range of use cases and optimization strategies. I came to realize that it's complicated to estimate the total cost of ownership of running and maintaining infrastructure. But I have a few takeouts that directionally help. My goal in this post is to consolidate some of the most useful cost-related best practices, empowering architects and FinOps teams with the necessary tools to optimize their in-memory data costs effectively.

When considering the cost structure of in-memory data stores, it's essential to account for both the direct costs of the service, such as the instances on which the service runs, and the indirect costs, which include monitoring, maintenance, and other DevOps operations. The key is to understand the memory, CPU, and networking requirements of your application and how they interrelate. Optimizing these aspects can significantly impact the overall cost-efficiency of your in-memory data store solutions.

Lay of the Land

Redis is an in-memory data store that's great for applications that need super-fast responses and handle a lot of traffic. Redis was originally created by Salvatore Sanfilippo more than 15 years ago. Redis is optimized for high performance on small instances (with a single CPU). It can typically run workloads of a few GB of memory up to ~200K QPS of simple operations. For bigger workloads, Redis can be deployed in a cluster mode where data is sharded between instances. However, each instance is still limited to utilizing mostly a single CPU. This is one of the reasons why Dragonfly was created. Dragonfly is a modern multi-threaded in-memory data store that supports both Redis and Memcached APIs. Created in 2022, Dragonfly is optimized to utilize modern hardware architectures to solve data growth challenges. Dragonfly scales vertically first to support millions of QPS on hundreds of GB of data from a single multi-core server. For anything bigger than that, Dragonfly also supports a cluster configuration. Most in-memory data store services and products are built on top of either Redis or Dragonfly.

Before we get into the nitty-gritty, let's clear up one thing: memory ain't cheap. Before you jump on the in-memory bandwagon, ask yourself these two questions:

Do you get more than a few thousand requests per second now or expect to in the future?
Does your application need responses in less than a millisecond?

If you answered yes to either of those questions, then you probably need an in-memory data store like Redis, Valkey, or Dragonfly. If not, a disk-based database may suffice for your needs.

Self-Hosted vs. Managed Services

Choosing the optimal solution for your team can significantly influence your in-memory data costs. Some organizations have the necessary culture and expertise to self-host and manage their infrastructure, while others prefer to use managed services whenever possible. If your organization has the capability to self-host and maintain your in-memory data store deployment, this can often be the most economical choice.

Self-Hosted

For organizations inclined towards self-hosting, options like Redis, Valkey, KeyDB, and Dragonfly should be considered. Self-hosting often proves to be more cost-effective compared to managed solutions, which typically charge a premium over hardware costs. The direct cost of self-hosting can be around half the price of a managed service. Nonetheless, hidden costs such as maintenance, patching, security configuration, and backups can arise, influenced by factors like application complexity, throughput, and memory requirements. The more complex and memory-intensive the application, the higher these costs may be.

If your in-memory workloads can be served from a single instance, that is ideal. For example, Dragonfly can scale to hundreds of GB of memory and millions of QPS from a single instance, which is usually sufficient for the vast majority of projects. However, if your application demands more, you will need to set up a cluster. In this scenario, it is advisable to use a managed solution, as provisioning and managing a cluster topology is highly complex. Managed service providers have years of experience optimizing their services for cluster deployments, leading to a lower total cost of ownership (TCO) compared to self-hosting a cluster.

Managed Services

In-memory managed services offer significant value in terms of availability, backup, monitoring, and overall management. Offloading part of the responsibility of your in-memory data store deployment to a managed service naturally incurs direct costs, but it can also reduce management costs in some cases. For some organizations, offloading to a managed service aligns with their operational strategy. Others should evaluate their management costs and the potential cost of lost business due to downtime against the additional costs of a managed service. It's important to note that having a managed service does not absolve your DevOps team of all responsibilities. Instead, think of it as an extra layer of experts to support your DevOps team.

Once you decide on using a managed service, the next step is to choose the right one. Your options include:

Cloud Provider Managed Services: Every major cloud provider offers at least one flavor of an in-memory data service. In its most basic form, a cloud-managed service is a control plane built around an open-source Redis, or nowadays, the Valkey fork. The control plane is a management layer that orchestrates the deployment and management of the instances. While you are still managing instances (similar to self-hosting), the cloud control plane helps you deploy faster and avoid misconfigurations. This convenience comes at a premium, often costing over 60% more than the instance cost alone.
Some cloud providers, like AWS, offer multiple in-memory data products (e.g., ElastiCache, MemoryDB, and ElastiCache Serverless). As with other cloud services, the more automation and features you receive, the higher the cost. For example, a durable or serverless Redis solution can cost 5 to 10 times more than the instance cost.
Multi-Cloud Independent Software Vendors (ISVs): Independent software vendors like Redis Cloud, Aiven for Dragonfly, or Dragonfly Cloud provide enhanced features over cloud provider managed services, with their main value residing in management simplicity. These vendors have teams of experts who enhance both the control plane and the in-memory engine, resulting in a more robust solution that requires less maintenance from your team. Pricing for these services can vary significantly: some vendors, like DragonflyDB, may offer services at up to 80% less than cloud providers, while others may charge more.

Sizing

Modern cloud hardware comes with a few CPU-to-memory ratios. The common ones are 1:2, 1:4, 1:8, and 1:16 (one CPU to 16GB of memory). The best teams I've encountered know both their memory and throughput requirements. For example, teams that are heavy on memory will need different instances than those that rely heavily on Lua scripting, which is CPU-intensive. Achieving the optimal ratio is a trial-and-error process that can take a few weeks. You start with a high memory ratio (1:16) and adjust downward if your CPU usage peaks above 80%. Setting this ratio correctly can save on hardware costs and provide peace of mind for better sleep at night, which is priceless.

However, the challenge is that some clouds and services do not offer all types of machines, so in some cases, you will have to self-manage your workload, at least for the first few weeks.

Provisioned vs. Serverless

Provisioned

With in-memory data stores, your application must always have enough memory resources. Provisioning memory and CPU resources for your in-memory data store helps protect the application from traffic bursts. In addition to determining your CPU-to-memory ratio, you should also assess the maximum load your system needs to sustain. Meaning, what is the highest load that we will get in the busiest second ever? This question relates more to how you manage risks at your company. I have seen teams that are provisioned 10x on memory. I have seen others that are at 10%. The rule of thumb is that smaller instances require a higher degree of provisioning due to factors discussed in The Unbearable Lightness of Horizontal Scaling blog post. Your organization's decision on that can significantly impact both direct and indirect costs.

Serverless

The idea of having a single endpoint where you can store an infinite amount of data with unlimited throughput and sub-millisecond latency sounds like a dream. This is the promise of serverless offerings. However, for in-memory data stores, achieving this is extremely challenging and correspondingly expensive. Detaching compute from storage is the dominant architecture for serverless disk-based databases. This allows for the dynamic creation of compute instances to handle query bursts. Below is a simplified diagram of a typical distributed on-disk database.

However, for sub-millisecond in-memory data stores, storing data on disks is not a viable option for many use cases as it causes latency spikes. So until today, we have not seen any true serverless in-memory data stores, as memory is not an easily shareable resource.

Last year, AWS announced ElastiCache Serverless. While it is not unlimited or infinite and its direct costs are high, its indirect costs are very low. As we discussed in a previous blog post, instead of configuring, running, monitoring, and maintaining t3.micro instances, I recommend teams buy peace of mind and use ElastiCache Serverless for any application that requires less than 2GB of memory. The 2GB threshold is somewhat arbitrary, but with a cost of more than $200 per month, provisioning fixed-sized instances becomes more cost-effective as the price of serverless is approximately 10 times the price of provisioned memory.

Pricing Model

Provisioned Instance & Memory Costs

When analyzing the pricing models of hyper-scale cloud providers, one conclusion becomes clear: the insurance premium for full and automatic elasticity is extremely high. Consider the following ElastiCache scenarios as an example.¹ A reserved node with an all-upfront payment, which essentially offers the least automatic elasticity, results in the lowest hourly rate. On the other hand, ElastiCache Serverless, which provides the highest automatic elasticity, is significantly more costly if you want to achieve the same memory workload. If you are confident that your workload will fit within fixed-sized instances with less likely spiky traffic, a good eviction strategy, and steady user growth within a predictable time span, then reserved instances are the clear most cost-effective choice.

Instances	vCPUs	Memory (GiB)	Reserved (Hourly)	On-Demand (Hourly)	Serverless (Hourly, Memory-Only)
`m6g.xlarge`	4	12.93	0.189	0.297	1.735
`m6g.2xlarge`	8	26.04	0.378	0.593	3.495
`m6g.4xlarge`	16	52.26	0.755	1.186	7.014
`m6g.8xlarge`	32	103.68	1.511	2.372	13.916
`m6g.12xlarge`	48	157.12	2.266	3.557	21.088
`m6g.16xlarge`	64	209.55	3.021	4.743	28.125

It's hard to get sizing right from the get-go. You need to remain provisioned and flexible until you can figure out the memory and CPU unit economics of your application. Serverless or on-demand pricing models offer better flexibility, allowing you to adapt to new use cases and growth in your applications. Once your workload stabilizes, you've optimized your sizing, and your application traffic is stable, switching to an annual reserved data store becomes more cost-effective. With a reserved commitment, you can save 10%–40% on your direct costs. However, it's important to note that management costs remain unaffected by this change.

Data Transfer Costs

In-memory data stores are extremely fast creatures. In 99% of cases, latency is due to the network latency between the data store and your application. Therefore, it's crucial to position the data store as close to your application as possible. A data store within the same Availability Zone (AZ) as your application will offer the best performance and the lowest cost, as you will experience minimal latencies and likely need fewer application servers. With VPC peering, there are no costs for data transfer if you are within the same zone.

However, if your setup involves cross-zone or cross-region configurations, you will incur costs for the egress traffic between the zones. AWS, for instance, charges various rates for data transfers. While data transfer costs don't seem significant at first, they can add up quickly as well, depending on your traffic. It's important to consider these costs when planning your in-memory data setup, and more details can be found here.

Backup Costs

While disk and cloud storage are much more affordable than memory, backup costs can still be significant for in-memory data stores. Some in-memory engines, like Redis OSS, can double the memory needed during backup, meaning you may need to provision more than twice the memory required for your data store. In contrast, Dragonfly uses a different snapshot mechanism, as discussed in detail in this blog post, that requires minimal extra memory to complete.

The last crucial aspect to consider is the frequency and retention policy for data store snapshots. In most cases, keeping only the latest snapshot is sufficient. However, if there is one area where you might consider going wild, do it here—as again, disk costs are much lower than those of memory.

To Conclude

The cost structure of your in-memory data store can vary significantly. While the direct cost of hardware is an important factor, you must also consider management costs and the cost of failure. By making informed choices and optimizing your configuration, you can not only reduce the cost of your in-memory data store by an order of magnitude but also ensure greater peace of mind, knowing your system is both cost-effective and reliable.

Prices are based on AWS US East (N. Virginia) as of May 2024. For ElastiCache reserved node price calculations, we assume that payments are made all-upfront for 1 year. For ElastiCache Serverless price calculations, we convert GiB to GB first, then times $0.125 GB-hour. We also exclude the ECPU costs and consider only the memory costs. ↩