Dragonfly vs. Valkey 9.0 on Graviton: Head-to-Head

Valkey 9.0 shipped, and the release notes make a real claim: significantly improved I/O throughput. The last time we published a full Dragonfly vs. Valkey comparison was on GCP against Valkey 8.0.2, so the numbers from that post are no longer the right reference. We owed both communities a fresh look.

This post is that look. We ran Dragonfly and Valkey 9.0 against each other on AWS Graviton instances, with identical hardware, identical load generators, identical network tuning, and we are publishing every command. The goal is reproducibility and an honest read of where each system stands today.

The short version: Valkey 9.0 is a meaningful step up over Valkey 8.x on the I/O path, and the gap on read-heavy workloads has narrowed. On writes, on vertical scaling, on tail latency, and on memory efficiency, Dragonfly's architecture continues to pull ahead — in some cases by an order of magnitude.

Why we re-ran this benchmark

Two things changed since our last post.

First, Valkey 9.0 landed with a redesigned I/O path. The team's work to push network reads and writes off the main thread is real engineering, and benchmarks against Valkey 8 are no longer the right baseline. If we are going to talk about Valkey performance in 2026, we should be talking about 9.0.

Second, AWS released new generation instances which we have not benchmarked yet, and we wanted to use this opportunity to cover a wider range of sizes. Vertical scaling has always been the more interesting question to me, because the operational cost of a single big box versus a cluster of small ones is enormous, and most teams discover this only after they have committed to one direction.

So we ran the same workloads on three instance sizes and measured throughput, full latency histograms, and per-entry memory cost. Then we wrote it up without dressing up the parts that go against us.

Test setup

Hardware

Servers: m7g.2xlarge (8 vCPU), m7g.4xlarge (16 vCPU), and c7gn.metal (64 vCPU, full socket Graviton).
Clients: c7gn.4xlarge for the m7g tests, c7gn.metal paired with c7gn.metal for the high-end run. The client machine is always larger than the server to make sure the load generator is not the bottleneck.
OS: Ubuntu 24.04.2 on both client and server.
Network: we tuned SMP affinity for NIC IRQs to spread the interrupt load across multiple CPUs. At the throughputs we are pushing — well above 500K QPS in many cases — default IRQ pinning becomes a bottleneck before either data store does, and a benchmark that does not address this is measuring the kernel, not the database.

Software

Dragonfly: latest from main.
Valkey: valkey/valkey:9.0 from the official Docker image.
Load generator: dfly_bench, the same tool we have used in every recent benchmark. It is source-available and uses runtime flags very similar to memtier_benchmark.

Server commands

Dragonfly:

./dragonfly --conn_use_incoming_cpu --dbfilename= --logtostderr --port 6380

Valkey:

docker run --network="host" --rm valkey/valkey:9.0 \
--save "" --appendonly no \
--io-threads 8 \
--protected-mode no --port 6380

A note on Valkey's io-threads. We did increase this setting beyond 8 on the 16 vCPU box and saw no additional throughput. This is consistent with Valkey's architecture: I/O threading lifts network handling and parsing off the main thread, but command execution still runs on a single core. Past a certain point, more I/O threads cannot help, because the main thread is already saturated. We are not under-threading Valkey to make it look bad; we are reflecting where the engine actually plateaus.

Client commands

Writes:

./dfly_bench -n $N -p 6380 --qps=0 -d 64 \
--key_maximum=$KMAX -c $CONN \
--command "setex __key__ 10000 __data__"

Reads:

./dfly_bench -n $N -p 6380 --qps=0 -d 64 \
--key_maximum=$KMAX -c $CONN --ratio 0:1

N, KMAX, and CONN were tuned per server to saturate the engine and to populate enough keys that the server's memory footprint reflected realistic load rather than a half-empty table. Exact values are in the appendix.

What we did not test

This benchmark covers string SET/GET workloads with 64-byte values, no persistence, no replication, and no cluster mode. A follow-up post might cover more complicated setups. We say this up front because a single benchmark cannot answer every question, and we would rather you generalize from these numbers conservatively than over-read them.

Results: m7g.2xlarge (8 vCPU)

This is the size where Valkey's I/O improvements should show up most clearly. Eight cores is comfortably within Valkey's I/O threading sweet spot.

Metric	Dragonfly	Valkey 9.0
Write QPS	816K	548K
Read QPS	848K	811K
Write P99 latency	787 µs	3,283 µs
Read P99 latency	887 µs	951 µs
Bytes per entry	127	149

Reads are close. Valkey 9.0 hits 811K QPS to Dragonfly's 848K — about a 5% gap. That is a real improvement over what Valkey 8 could do at this size, and it is the headline result of the Valkey team's I/O work. Credit where it is due.

Writes are a different picture. Dragonfly's write throughput is 1.5x higher (816K vs. 548K), and the write P99 is 4.2x lower (787 µs vs. 3,283 µs). The throughput gap is interesting on its own, but the latency gap is the operationally important number. If you have a write-path SLO measured in low milliseconds, the Valkey 9.0 distribution is going to spend a meaningful slice of its time outside that budget at this load.

The other thing worth noting is the shape of Valkey's write performance over time. Throughput visibly degrades as the main hash table grows and rehashes — the same pattern we documented in our GCP benchmark against Valkey 8. Dragonfly's write throughput stays flat through the entire load phase. This is an architectural artifact, due to differrent hashtable implementation in Dragonfly, and it shows up whenever the working set crosses the rehash threshold in Valkey.

Results: m7g.4xlarge (16 vCPU)

This is where the architectural difference starts to show.

Metric	Dragonfly	Valkey 9.0
Write QPS	1,563K	599K
Read QPS	1,581K	844K
Write P99 latency	609 µs	4,387 µs
Read P99 latency	784 µs	1,154 µs
Bytes per entry	127	150

Doubling cores from 8 to 16 nearly doubled Dragonfly's throughput on both reads and writes. Valkey gained roughly 10% on writes (548K to 599K) and 4% on reads (811K to 844K). This is exactly what you should expect from an engine where command execution is single-threaded: extra cores help with I/O up to a point, and after that they sit idle.

If you size for peak load by picking a bigger instance, Valkey returns less per dollar past the 8 vCPU mark. The write P99 picture also widens at this size: 609 µs on Dragonfly versus 4,387 µs on Valkey, a 7x gap.

Results: c7gn.metal — Dragonfly's ceiling

We did not run Valkey on c7gn.metal. There is no point: adding cores beyond the I/O threading ceiling does not change Valkey's numbers. We ran this configuration to answer a different question: how far does a single Dragonfly process go on hardware that actually scales?

Metric	Dragonfly
Write QPS	6,664K
Read QPS	6,663K
Bytes per entry	126

6.6 million QPS on writes, 6.6 million on reads, out of one process. The point is not that everyone needs this — most workloads do not. The point is that when you do need this kind of throughput, you can get there without running a cluster.

One honest caveat: AWS instance sizing is not perfectly linear. When we benchmarked m7g.8xlarge, we saw throughput numbers very close to m7g.4xlarge, suggesting we hit a network or memory bandwidth limit somewhere in the instance family. c7gn.metal cleared that ceiling due to its enhanced networking capabilities. When you benchmark on AWS, the instance family matters as much as the engine, and a 2x larger box does not always give you 2x more performance.

Memory efficiency

Per-entry memory overhead, measured on a fully populated keyspace with 64-byte values:

Dragonfly: ~127 bytes per entry
Valkey 9.0: ~149–150 bytes per entry

That is about 15% less RAM for the same dataset. On a 60 GB working set, that is roughly 9 GB you do not have to provision. The underlying reason is Dragonfly's dashtable layout, which has lower per-entry overhead than Valkey's hashtable plus dictEntry structure.

Memory efficiency is the kind of number that looks small in a table and shows up large on the bill. If you are running a 100 GB cache, a 15% reduction is the difference between one instance class and another.

On tail latency

The throughput numbers get the headlines, but P99 is the number that determines whether you sleep at night. A few observations from the histograms:

On m7g.4xlarge writes, Valkey's distribution piles up in the 600–800 µs range with a long tail extending past 4 ms. Dragonfly's writes cluster much tighter, with the bulk of the distribution in the 300–450 µs range and a short tail.

This is consistent across instance sizes. Dragonfly's P99 is not just lower than Valkey's; the entire shape of the distribution is tighter. For workloads with a P99 SLO (and most production caches have one, whether or not it is written down), this matters more than the average or the median.

The mechanism is the same one that explains the write throughput gap: a single-threaded execution path serializes every command, which means that any command that takes slightly longer pushes everything behind it. A sharded, shared-nothing architecture spreads that risk across cores.

Where Valkey 9.0 wins or closes the gap

On read-heavy workloads on small instances, the throughput gap has narrowed significantly compared to Valkey 8. If your workload is read-dominant, tail-tolerant, and your fleet is small instances, the upgrade from Valkey 8 to 9 will give you a real lift.

Valkey also wins on ecosystem maturity in the obvious ways: decades of operational knowledge, broad client library coverage, and a deep bench of people who know how to debug it at 3 AM. We are not going to pretend Dragonfly has matched that yet. We have made progress, but it is the kind of progress that takes years.

What Valkey 9.0 has not done is change the underlying execution model. Command execution still runs on a single thread. That is a deliberate design choice with real benefits — atomicity is simple, locking is trivial — and for a long time it was the right trade.

Closing thoughts

We re-ran this benchmark because Valkey 9.0 deserved a fresh look, and the answer is that the new I/O path closes real ground on read throughput at smaller sizes. On vertical scaling, on tail latency, on write throughput, and on memory efficiency, Dragonfly's architecture continues to deliver a different class of performance.

The number we keep coming back to is the c7gn.metal ceiling: 6.6 million QPS, single process, no cluster. That is not a number you reach by tuning a single-threaded engine. It is the result of a shared-nothing, thread-per-core design that scales with the hardware you give it.

If you want to reproduce these results, every command in this post should work on a fresh AWS account. If you run them and see materially different numbers, open an issue on the Dragonfly GitHub, we will investigate publicly.If there is a workload you most want to see compared next, let us know on Discord or the forum.

Appendix: full server configurations

Dragonfly

All instance sizes:

./dragonfly --conn_use_incoming_cpu --dbfilename= --logtostderr --port 6380

For the m7g.2xlarge test we set --maxmemory=28GB. For the m7g.4xlarge test we set --maxmemory=58GB. No other Dragonfly tuning was applied — the engine chooses its own thread count per machine.

Valkey 9.0

docker run --network="host" --rm valkey/valkey:9.0 \
--save "" --appendonly no \
--io-threads 8 \
--protected-mode no --port 6380 \
--maxmemory <size>

maxmemory was set to 28 GB on the m7g.2xlarge and 58 GB on the m7g.4xlarge. We tested with --io-threads set higher than 8 on the 16 vCPU box and observed no additional throughput, consistent with the documented behavior of Valkey's I/O layer.

Client parameters

Test	Connections (-c)	Requests (-n)
m7g.2xlarge writes, Dragonfly	25	2,000,000
m7g.2xlarge writes, Valkey	25	2,000,000
m7g.2xlarge reads, Dragonfly	30	200,000
m7g.2xlarge reads, Valkey	30	200,000
m7g.4xlarge writes, Dragonfly	30	3,000,000
m7g.4xlarge writes, Valkey	30	3,000,000
m7g.4xlarge reads, Dragonfly	40	200,000
m7g.4xlarge reads, Valkey	40	200,000
c7gn.metal writes, Dragonfly	30	3,000,000
c7gn.metal reads, Dragonfly	30	200,000

--key_maximum was sized per instance to populate the working set: 200M for the 8 vCPU runs, 400M for the 16 vCPU and metal runs.

Dragonfly vs. Valkey 9.0 on AWS Graviton: An Honest Head-to-Head

Why we re-ran this benchmark

Test setup

Hardware

Software

Server commands

Client commands

What we did not test

Results: m7g.2xlarge (8 vCPU)

Results: m7g.4xlarge (16 vCPU)

Results: c7gn.metal — Dragonfly's ceiling

Memory efficiency

On tail latency

Where Valkey 9.0 wins or closes the gap

Closing thoughts

Appendix: full server configurations

Dragonfly

Valkey 9.0

Client parameters

Switch & save up to 80%

Dragonfly vs. Valkey 9.0 on AWS Graviton: An Honest Head-to-Head

Why we re-ran this benchmark

Test setup

Hardware

Software

Server commands

Client commands

What we did not test

Results: m7g.2xlarge (8 vCPU)

Results: m7g.4xlarge (16 vCPU)

Results: c7gn.metal — Dragonfly's ceiling

Memory efficiency

On tail latency

Where Valkey 9.0 wins or closes the gap

Closing thoughts

Appendix: full server configurations

Dragonfly

Valkey 9.0

Client parameters

Stay up to date on all things Dragonfly

Switch & save up to 80%