Threading Models Matter: Dragonfly Pulls Ahead of Valkey

Valkey 8.0 introduced enhanced asynchronous I/O threading, marking a big step forward in performance and scalability. By allowing network reads and writes to run in different threads, Valkey can potentially boost throughput on modern multi-core machines significantly. This evolution mirrors the path Redis took a few years earlier. Redis 6.0 first added basic I/O threading, and Redis 8.0 refined it further. Both systems now benefit from parallel I/O handling that improves throughput for lightweight operations. To see a comparison between Valkey and Redis, you can check out our previous article.

But while these changes improved I/O efficiency, the core architecture of both Valkey and Redis remains fundamentally single-threaded. Data operations, like command execution or script evaluation, still run on one main thread. That design preserves atomicity and simplicity, but it also means CPU-intensive workloads can’t scale with additional cores. In practice, this design becomes a major bottleneck for use cases like sorted sets or search workloads, where the data structure must constantly maintain ordering while new items are inserted.

In this article, we’ll examine how Valkey’s threading model compares to Dragonfly’s native multi-threaded architecture and why the difference matters. Using sorted set operations as an example, we’ll also look at how each system handles CPU-heavy workloads with some benchmark results.

How Valkey’s Threading Model Works

Valkey’s architecture inherits from the classic Redis design: a single-threaded command execution model paired with multi-threaded network I/O. The I/O layer, which is responsible for reading and writing data over client connections, can now use multiple threads to handle large numbers of simultaneous requests. This was a big step forward in Valkey 8.0, which makes better use of modern multi-core CPUs for network-bound workloads.

However, once data reaches the core engine, all command execution still happens on a single main thread. This design choice is deliberate, as it guarantees atomicity, meaning each command runs to completion before the next one begins. Atomicity ensures correctness and eliminates the need for complex locking mechanisms between threads. For example, if two clients issue the command INCR counter at the same time, Valkey processes one completely before starting the next. The counter will increment cleanly from 0 → 1 → 2, never skipping or overlapping values. This simplicity is one of the main reasons Redis and Valkey have remained so stable and predictable under load.

The trade-off is that data operations remain single-threaded. Tasks like adding items into sorted sets or running expensive Lua scripts all execute on that single core. While I/O threading can speed up request handling and improve throughput for lightweight workloads, CPU-intensive commands still hit a ceiling as they can’t take advantage of multiple cores. In practical terms, this means that no matter how many CPUs you give Valkey, the actual data manipulation step will always run on just one of them. This creates a significant bottleneck for compute-heavy use cases.

Dragonfly: Fully Multi-Threaded

Unlike Valkey and Redis, which evolved from single-threaded engines, Dragonfly was built from the ground up for modern infrastructure, keeping multi-threaded data execution in mind. Its architecture takes full advantage of multi-core hardware by distributing data across independent shards, with each CPU core responsible for its own subset of keys. This means there’s no global lock and no single main thread coordinating all data operations, and each thread runs its own event loop and processes commands in parallel. This design allows Dragonfly to maintain atomicity without reverting to single-threaded execution. In the case of multi-key operations (i.e., MSET) where keys may be spread among multiple shards, instead of relying on a global lock to serialize operations, Dragonfly uses a lightweight transactional framework to ensure atomicity efficiently. Because of this architecture, CPU-intensive workloads can also scale linearly with available cores, resulting in true parallelism, which is something Valkey’s single-threaded data model simply can’t achieve, no matter how many I/O threads it adds.

Sorted Set: A CPU-Intensive Data Structure

Certain operations in in-memory data stores are naturally CPU-intensive. Sorted sets, Lua scripting, and search or geospatial queries all require the engine to perform continuous sorting and reordering of elements as new data arrives. Geospatial indexes, for instance, are internally based on sorted sets, which means every coordinate update triggers the same ordering logic. These workloads aren’t limited by how fast data can move over the network but rather by how fast the server can perform computations on the data itself.

Even with Valkey 8.1’s improved multi-threaded I/O, these core data operations still execute on a single thread. The result is that I/O becomes faster, but CPU-heavy workloads don’t scale. As demonstrated in this blog post, Valkey 8.1 achieved impressive memory efficiency of about 22-27% lower memory usage compared to Valkey 8.0 but only modest throughput gains of around 7-8% for sorted sets. Those numbers confirm that while Valkey’s I/O layer can now use multiple cores, its command execution path cannot.

Dragonfly, on the other hand, approaches the same problem differently. Its B+ tree based sorted set implementation drastically reduces memory overhead. More importantly, the performance also scales effectively across cores for sorted set workloads, which are extremely CPU-intensive.

Memory Efficiency: Dragonfly’s Persistent Advantage

To validate these architectural claims in practice, we aimed to reproduce the benchmark on Dragonfly using the same tool. However, we noted that the workload used in the benchmark tool (by the time of writing, it’s adding items with their scores in an increasing order to a single sorted set key) doesn’t reflect a practical production scenario where item insertions are non-sequential and data is distributed across many keys. With that being said, we proceeded with the test to confirm we could replicate the baseline numbers, at least for memory consumption, before moving on to a more realistic workload.

# Download and build the zset_bench tool.
$> git clone git@github.com:momentohq/sorted-set-benchmark.git
$> cd zset_bench
$> cargo build --release

# Run the zset_bench tool against both Valkey and Dragonfly.
# Machine specs do not matter here as long as it can hold ~8GB of data for both engines.
$> ./target/release/zset_bench --valkey valkey_host --redis dragonfly_host

Using the zset_bench tool, we benchmarked both Valkey and Dragonfly under the same conditions: adding 50 million items to a single sorted set with a pipeline of 1000 commands. The results below capture the memory usage, recorded at each 5-million-item step.

Dragonfly v1.34.2 vs. Valkey v9.0.0 | Sorted Set Memory Usage

As the results show, Dragonfly maintained a 25-40% lower memory footprint throughout the test, ending with 2.72GB vs 3.77GB for Valkey at 50M items. While Valkey has made impressive strides in memory efficiency with its recent improvements, Dragonfly’s underlying data structure preserves its advantage. This difference stems from its core data structure: where Valkey’s skiplist carries an overhead of about 37 bytes per item, Dragonfly’s B+ tree implementation slashes this to just 2-3 bytes, accounting for the significant and consistent memory savings.

Throughput Performance: Dragonfly Scales Linearly with Cores

To measure throughput under a more realistic workload, we used the dfly_bench tool, which is source-available and part of Dragonfly’s releases. This tool allows us to easily generate varied and practical traffic patterns. For a clear comparison, we replicated the environment from the Valkey blog post:

Client & Server:
- Two AWS c8g.2xlarge instances (Graviton4, 8 vCPU, 16GB RAM), one for dfly_bench and one for Valkey/Dragonfly.
- Valkey and Dragonfly were tested separately.
Server Configurations:
- Valkey: Tested with io-threads disabled and io-threads=6, the latter being the recommended setting for an 8-core machine by Valkey.
- Dragonfly: Uses 8 threads by default on an 8-core machine, leveraging its inherent architectural advantage.
Workload: Sent 50 million ZADD requests with a pipeline of 1000 commands, using --d=36 to simulate storing 36-byte values like UUIDs. Scores were randomized as well.

### Server Configs

# Valkey default configs with persistence disabled.
$> ./valkey-server --save '' --appendonly no --protected-mode no

# Valkey default configs with persistence disabled, 6 I/O threads.
$> ./valkey-server --save '' --appendonly no --protected-mode no --io-threads 6

# Dragonfly default configs with persistence disabled.
$> ./dragonfly --dbfilename=

### Client Configs

# A total of 50 million ZADD commands with 8 client threads
# sending in parallel: 6,250,000 commands per client.
#
# The sorted set element value size is 36 (simulating UUIDs).
# But you should consider storing UUIDs in their binary form whenever it is applicable. 
#
# The sorted set scores are randomized.
# The total number of possible sorted set keys is 100.
# Send without throttling by specifying '--qps=0'.
$> ./dfly_bench --command "ZADD __key__ NX __score__ __data__" \
                --d=36 --c=1 --qps=0 --pipeline=1000 --n=6250000 \
                --proactor_threads=8 --key_maximum=100 --h=host --p=port

The results highlight a fundamental architectural divergence:

Dragonfly v1.34.2 vs. Valkey v9.0.0 | Sorted Set Throughput

The data tells a clear story. Enabling I/O threads in Valkey provides a minor ~5% throughput lift. This is precisely what we expect with a CPU-bound workload: while I/O threads help with network handling and command parsing, the sorted set operations themselves are still executed by a single main thread, which becomes the bottleneck.

In contrast, Dragonfly’s multi-threaded architecture allows it to distribute the computational load of these ZADD operations across all available CPU cores. The result is a throughput of over 1.1 million QPS, which is roughly 7.3x higher than Valkey’s best configuration on the same 8-core hardware. This is not a minor tuning improvement but a direct consequence of a design built for modern, multi-core systems. If you’re interested, we’ve tested Dragonfly against Valkey, both with older versions as well, reaching 29x higher throughput on a 48-vCPU GCP C4 instance. The story remains consistent: Dragonfly’s performance scales with available CPU cores, while Valkey’s is ultimately limited by a single core.

Sorted Set	Valkey 9.0.0	Dragonfly 1.34.2
Threading Model	Single-threaded data execution, multi-threaded I/O	Fully multi-threaded, thread-per-core data sharding
Underlying Data Structure	Skiplist	B+ tree
Throughput	~7–8% higher vs. Valkey 8.0	More than 7x vs. Valkey 9.0 on AWS `c8g.2xlarge`
Memory Consumption	~22–27% lower vs. Valkey 8.0	25-40% lower vs. Valkey 9.0
Scalability	Limited by single-core performance	Scales linearly with cores for CPU-heavy workloads

So to sum up, the benchmark results clearly illustrate the performance profile of each system for CPU-intensive sorted set workloads. Valkey demonstrates incremental improvements in its single-threaded context, achieving modest gains in memory efficiency and minor gains in throughput compared to its previous versions.

However, Dragonfly’s architectural choices deliver a decisive advantage. Its B+ tree implementation provides significantly lower memory consumption, while its fully multi-threaded, shared-nothing architecture unlocks massive throughput scalability.

Threading Models Matter for Real-World Workloads

Threading architecture isn’t just an implementation detail, it defines how an in-memory data store behaves under real-world pressure. For cache-heavy workloads like simple GET and SET operations, Valkey can perform decently, as its multi-threaded I/O layer helps it handle high connection counts.

But as workloads grow and become more compute intensive (i.e., sorted sets, search operations), that single-threaded design quickly becomes a ceiling. Because command execution is tied to one core, Valkey (and Redis) can’t parallelize complex data operations, no matter how many CPUs are available. Adding more hardware improves I/O throughput, but not the actual speed of computation.

Dragonfly, however, sidesteps this bottleneck entirely. Its architectural and data structure advantages enable parallel execution while maintaining atomicity, resulting in consistent, linear scaling: as you add more CPU cores, Dragonfly’s throughput rises accordingly.

For organizations running heavy in-memory workloads, particularly in today’s AI-driven landscape where context-rich operations demand more data and compute per user request, this scaling capability translates directly into better performance, lower latency, and far more efficient hardware utilization. When every CPU core counts, the threading model of your data store ultimately determines how far your application can go.

Why Threading Models Matter: Dragonfly Pulls Ahead of Valkey in CPU-Intensive Workloads

How Valkey’s Threading Model Works

Dragonfly: Fully Multi-Threaded

Sorted Set: A CPU-Intensive Data Structure

Memory Efficiency: Dragonfly’s Persistent Advantage

Throughput Performance: Dragonfly Scales Linearly with Cores

Threading Models Matter for Real-World Workloads

Switch & save up to 80%

Why Threading Models Matter: Dragonfly Pulls Ahead of Valkey in CPU-Intensive Workloads

How Valkey’s Threading Model Works

Dragonfly: Fully Multi-Threaded

Sorted Set: A CPU-Intensive Data Structure

Memory Efficiency: Dragonfly’s Persistent Advantage

Throughput Performance: Dragonfly Scales Linearly with Cores

Threading Models Matter for Real-World Workloads

Stay up to date on all things Dragonfly

Switch & save up to 80%