Dragonfly SSD Data Tiering: Cost-Effective Scaling for Massive Workloads

Background

Modern in-memory databases are commonly used as caches and are known for their high throughput and low latency compared to their persistent counterparts. However, their capacity is usually limited by their high cost, as RAM is simply more expensive. But what if they don’t always have to be fully in-memory to achieve almost the same performance?

In recent years SSD prices have decreased faster than those of other storage types. What is more, the latest SSD drives feature read speeds of up to 3.5GB/s, making them at least 10 to 20 times faster than hard disk drives. So it would come as no surprise that a natural solution could be to extend RAM capacity with SSD storage for handling datasets that exceed memory limits. Several systems offering this capability have emerged in recent years. Notable examples include the open-source Memcached/Extstore, released in 2018, and the proprietary ElastiCache Data Tiering, introduced in 2022.

Historical Price of Computer Memory and Storage

With the latest Dragonfly releases (v1.35.0, v1.35.1), we are advancing this hybrid architecture with a novel approach. This blog post will dive into the design of Dragonfly Data Tiering and contrast it with the implementations of these existing systems.

Larger-Than-Memory Data Stores

Unlike persistent databases, the goal of larger-than-memory data stores (or data tiered stores) is to offer a cost-efficient system that intelligently and seamlessly utilizes both RAM and SSD storage. Such stores aim to deliver performance close to that of pure in-memory systems while maintaining compatibility with the original data store API. The core challenge lies in developing algorithms and methodologies that can deliver performance metrics (throughput and latency) comparable to those of in-memory stores. This requires careful consideration of data placement, access patterns, and eviction strategies.

The basic mechanism can be described just in a single paragraph, but the details are where it gets more interesting. Each thread of Dragonfly’s shared-nothing architecture manages its own preallocated file as a continuous storage space for entries. Most importantly, this file stores only the entry data (i.e., the values), while all metainformation, including the entry key, expiry, and additional flags, remains in the in-memory global hash table. If an entry is stored on disk, the hash table just keeps its location inside the file. The layout of entries inside the file is managed by a special bookkeeper, which is similar to an allocator handing out continuous memory regions to a program.

Dragonfly SSD Data Tiering | High Level Overview

This preservation of essential data in main memory enables Dragonfly to plan and execute I/O-bound operations with high efficiency. When a read request arrives (assuming the specific value is on disk), the system already knows the exact location of the value, requiring only a single disk read operation to retrieve it. Furthermore, consecutive reads for the same entry are executed just once, and deletions require no disk I/O at all—the space is simply marked as free in the bookkeeper.

This efficient orchestration of operations is supercharged by Dragonfly’s use of io_uring for all its I/O operations, including disk access. By leveraging this modern Linux interface, Dragonfly executes I/O in a truly asynchronous manner. Issuing a high volume of operations concurrently allows it to fully saturate an SSD’s bandwidth, achieving optimal throughput.

Diving Deeper

Picking the Right Strategy

Dragonfly Data Tiering intelligently manages data across two layers to balance speed and capacity. Its core principle is simple and similar to what your server applications would do: keep the most frequently accessed “hot” entries in memory, while moving less frequently accessed “cold” entries to disk.

When enabled, a special background process continuously identifies candidate entries for offloading (storing on disk and evicting from RAM) by marking them. Any subsequent access to an entry clears this mark, ensuring that only data untouched over in consecutive iterations of the background job is demoted from RAM to disk. If a cold entry on disk is read again and free memory is available, it is automatically uploaded (or promoted) back to RAM.

This leaves a crucial middle ground: what about newly added entries? They are initially written directly to disk by default. However, if a new value is read immediately after being written, which is a common pattern in service-to-service communication, it would be slow to fetch it from disk. Dragonfly solves this by keeping a copy of the entry both on disk and in RAM, designating the entry as “cooled,” as they are neither purely hot (only in RAM) nor purely cold (only on disk).

This “cooled” state makes the entry instantly available for subsequent reads from memory, while the primary copy remains safely on disk. This means the in-memory copy can be discarded at any moment to free up space without waiting for a slow disk write. This elegant solution allows Dragonfly to gracefully handle write traffic spikes while making highly efficient use of available memory.

Dragonfly SSD Data Tiering | Cooled Entries

Bypassing the Page Cache

The Linux page cache, which typically speeds up file reads by caching frequently accessed data, becomes redundant and even counterproductive when used with Dragonfly Data Tiering. Since Dragonfly performs its own caching (uploading/offloading as discussed above) strategy, the page cache needlessly competes with it for valuable memory resources. This duplication leads to excessive data copying, slowing down operations, and every byte of RAM consumed by the page cache could be better utilized to hold more in-memory entries in Dragonfly.

To eliminate this overhead and gain finer control over I/O, Dragonfly bypasses the page cache entirely by using direct I/O (via the O_DIRECT flag). This approach requires aligning any operations to the page alignment. Furthermore, Dragonfly uses “registered buffers” with io_uring, allowing direct kernel access to preallocated user memory regions, to which data is stored or read from, avoiding any redundant copies.

Dragonfly SSD Data Tiering | Linux Page Cache

Small Values and Disk IOPS Limits

Dragonfly can offload values as small as 64 bytes, but as mentioned above, reads and writes with direct I/O have to be aligned to 4KB boundaries. Of course, small values could be placed each on their own 4KB block. However, this will cause very high write amplification, wasting precious disk IOPS and space. Therefore, small entries are grouped together within a single page. A special middle layer glues them together to reach an acceptable total size before passing them further to the disk storage.

Yet, over time, the deletion of many small values can leave disk blocks fragmented and mostly empty. To solve this, we designed a clever compaction algorithm. This process gradually scans fragmented on-disk pages, identifies the valid data within them, and rewrites that data into new, densely packed pages.

Dragonfly SSD Data Tiering | Page Defragmentation

The key to this efficiency is that Dragonfly stores entry keys alongside the values on disk. This simple design allows the system to trace any value back to its entry in the in-memory hash table using only the file offset, enabling this compaction without relying on memory-hungry reverse maps or complex metadata. All in all, the write and space amplification factor is kept below 2 for all values.

Dragonfly Data Tiering in Action

Tiered mode is currently supported only for string values. To run Dragonfly with data tiering, use the --tiered_prefix parameter, pointing to a specific file name prefix. It will create as many files as there are threads with a starting size of 256MB. For example:

$> ./dragonfly --maxmemory=20G \
               --tiered_prefix=/mnt/fast-ssd/dragonfly-tiered-file \
               --tiered_offload_threshold=0.2

The command above runs Dragonfly with a maximum allowed memory of 20GB and an offload threshold of 20%. It means that when less than 20% of memory remains unoccupied (e.g. 80% memory usage), Dragonfly will start offloading values to disk more aggressively, throttling incoming writes and moving items to the disk in the background.

The TIERED subsection of the INFO command provides a wide range of different metrics. Let’s look at the most basic ones:

tiered_entries: The number of value entries offloaded to disk.
tiered_entries_bytes: The amount of data (in bytes) offloaded to disk.
tiered_pending_read_cnt: The number of disk reads that are currently pending.
tiered_pending_stash_cnt: The number of disk writes that are currently pending.
tiered_allocated_bytes: The number of reserved bytes of disk space.
tiered_capacity_bytes : Total capacity of disk space.

Benchmarking

Having explored the architecture that makes data tiering possible, the critical question remains: how does it perform? To find out, we benchmarked Dragonfly against ElastiCache for Valkey on AWS r6gd.2xlarge instances for read-only workloads. For the benchmark, we configured both data stores with a 40GB max memory limit and data tiering enabled.

Before sending the read commands for benchmarking, the instances were filled with 40M values of 4000 bytes each, totalling 160GB of data and ensuring the dataset would exceed the configured capacity and actively utilize the SSD tier. The keys were chosen with a Zipfian distribution with an alpha factor of 0.9. With this setup, we can test both uniform key access and controlled hit rates. This client-side hit rate simulates the experience of a server-side application attempting to read a key, regardless of whether the data is served from memory or the SSD tier.

Uniform Key Access

In the first benchmark, keys were randomly selected from the full key range, resulting in a hit rate of about 50%. The benchmark was run with different numbers of connections. We can see, as the number of connections grows, the throughput of ElastiCache tops out at 287k reads/s. Dragonfly reaches 430k reads/second. Both data stores start with a P99 latency of about 1 millisecond, increasing to 2 and 4 milliseconds for Dragonfly and ElastiCache, respectively. We can conclude that Dragonfly achieves 1.5x the read throughput of ElastiCache, while also providing a lower P99 latency under load.

Dragonfly SSD Data Tiering | Uniform Key Access Latency (Lower Is Better)

Zipfian Key Access

This benchmark is analogous to the previous one but employs a Zipfian key distribution to create a high client-side hit rate of approximately 86%. With more requests now targeting valid keys, both systems must handle a greater proportion of successful reads, fetching data from both memory and the tiered storage. We can see that Dragonfly peaks at 360k reads/second, exceeding ElastiCache’s throughput by more than 100k. Under heavy loads, Dragonfly’s P99 latency is consistently lower as well. This latency advantage reverses only under very low client concurrency, where Dragonfly shows slightly higher latency.

Dragonfly SSD Data Tiering | Zipfian Key Access Latency (Lower Is Better)

Try It Today

Dragonfly’s SSD Data Tiering extends the data store’s high-performance, in-memory core to efficiently manage massive datasets that exceed RAM capacity. By intelligently leveraging SSDs, it offers a dramatic reduction in storage cost while preserving the low latency and high throughput that define the Dragonfly experience. If your use case involves large volumes of data with varied access patterns, we encourage you to try this feature and see how it can lower your infrastructure costs without compromising performance.

We are incredibly excited about this new capability and can’t wait for you to experience the power of massive, cost-effective tiered storage. To get started, head over to our documentation. Give it a try and let us know what you build.

Announcing Dragonfly SSD Data Tiering: Cost-Effective Scaling for Massive Workloads

Background

Larger-Than-Memory Data Stores