The Hidden Bottlenecks of Scaling Out: Network, CPU, and Memory

When teams think about scaling heavy workloads, they often focus on how many nodes (or pods in a Kubernetes environment) they’re running. But for stateful applications (i.e., databases, caching layers), it is much more important to also consider the balance of hardware resources such as CPU, memory, disk bandwidth, and network bandwidth. For instance, you can have plenty of compute headroom and still experience system slowdowns due to network limitations that prevent data from moving quickly enough.

Most teams don’t notice this until it’s already causing problems in production: we are talking about a caching layer handling millions of RPS, which essentially becomes part of the critical path in the overall system. At that point, it’s not unusual to see latency spikes, timeouts, or nodes hitting limits that no amount of tuning seems to fix, leading to degraded performance or even downtime. And more often than not, the culprit is the same: stateful data systems require more careful resource planning than stateless applications. In this blog, we’ll learn about the impact hardware constraints have on heavy in-memory workloads and compare how Redis, Valkey, and Dragonfly behave under such conditions.

Bigger Instances Get Better Treatment

Let’s take a look at the AWS ElastiCache cache.r7g memory-optimized instance family. At a first glance, it seems that all instances have the hardware resources allocated proportionally to their numbers of vCPUs:

They all have ~6.6 GiB memory per vCPU.
They all have ~0.47 Gbps baseline network bandwidth per vCPU.

For smaller instances, even burst network bandwidth is possible, although not guaranteed.

Instance Type	Baseline Bandwidth (Gbps)	Burst Bandwidth (Gbps)	vCPUs	Memory (GiB)
`cache.r7g.large`	0.937	12.5	2	13.07
`cache.r7g.xlarge`	1.876	12.5	4	26.32
`cache.r7g.2xlarge`	3.75	15	8	52.82
`cache.r7g.4xlarge`	7.5	15	16	105.81
`cache.r7g.8xlarge`	15	-	32	209.55
`cache.r7g.12xlarge`	22.5	-	48	317.77
`cache.r7g.16xlarge`	30	-	64	419.09

However, in reality, cloud providers cannot allocate resources equally across all instance types due to the inherent constraints of virtualization. This is true not only for CPU cycles but also for disk or network bandwidth. Allocation is fundamentally tied to the size and tier of the instance. Larger instances are provisioned with more guaranteed, high-performance resources, while smaller instances often share contended pools with “best-effort” access.

Taking the network bandwidth allocation as an example:

Large instances get higher, predictable bandwidth and often their own dedicated network paths. They can sustain heavy traffic for long periods without hitting their limits.
Small instances, on the other hand, live in a noisier neighborhood. Their bandwidth is lower, often burstable, and much more sensitive to other tenants using the same shared network.

ElastiCache Network Bandwidth Limitation Examples

Consider the following use case from this AWS blog post:

…an application running GET commands that return 25 KB values can expect a maximum RPS of 3,750 on a m6g.large instance type due to its 0.75 Gbps bandwidth limit. If the application sends a higher rate of commands, throttling will take place and packet delays or loss will eventually occur…

This creates a clear bottleneck: While this specific instance is small, its CPU and memory could potentially support a greater Redis workload. However, the network constraint imposes a surprisingly low cap for an in-memory data store. Any attempt to push beyond this limit results in throttling and packet loss.

Another common pitfall involves misunderstanding the “burstable” network performance of smaller instance types. A user reported on StackExchange that their Amazon ElastiCache deployment, built on cache.t4g.small instances, was throwing Bandwidth Allowance Exceeded errors despite showing low average network usage. The confusion arose because the instance’s “up to 5 Gibps” network capability is not guaranteed. It can burst to a higher bandwidth for short periods when it has accumulated credits from periods of low activity. Nonetheless, these credits deplete quickly under sustained load. Once exhausted, the throughput drops to a very low baseline rate.

ElastiCache bandwidth allowance exceeded error reported on StackExchange

ElastiCache Bandwidth Allowance Exceeded | StackExchange

As shown above, this behavior can create a deceptive performance profile. An instance might handle traffic spikes briefly but will throttle severely during prolonged operations, making it unsuitable for workloads with consistent demand.

The logical solution? Well, as often recommended in such discussions, it is to migrate the workload to a larger instance type. This directly addresses the fundamental constraint by providing a higher, guaranteed baseline bandwidth rather than a temporary burst allowance. AWS also advises this scaling principle for teams tackling the network limits:

It is important to note that every byte written to the primary node will be replicated to N replicas, N being the number of replicas. Clusters with small node types, multiple replicas, and intensive write requests may not be able to cope with the replication backlog. For such cases, it’s a best practice to scale-up (change node type), scale-out (add shards in cluster-mode enabled clusters), reduce the number of replicas, or minimize the number of writes.

Our customers who migrated to Dragonfly from Redis or Valkey report encountering this exact pattern countless times. We agree totally that for scaling high-throughput real-time data layers, moving to a larger instance (scale-up) or adding more shards (scale-out) is inevitable. However, this presents a critical architectural decision: which path is optimal for your specific workload, and what are the hidden trade-offs? The answer defines your system’s future performance, cost, and operational complexity. Let’s understand this in more detail in the next sections.

Redis & Valkey Force You Into Small Instances

Redis and Valkey are single-threaded for command execution. This architecture of Redis and Valkey presents a fundamental scaling dilemma. Even on a large machine with dozens or even hundreds of CPUs, a single instance can utilize only one core for its primary operations (with optionally a few more for multi-threaded network I/O or background tasks). To achieve greater throughput, users are forced to scale out, deploying in cluster mode with shards distributed across multiple servers. This architecture naturally leads to clusters composed of many smaller nodes rather than a few large ones, which directly impacts resource efficiency and network performance.

This forced scale-out strategy collides with the resource allocation model of cloud infrastructure, creating a significant constraint:

Lower Guaranteed Resources: Small instances receive not only lower guaranteed network bandwidth but also proportionally smaller allocations of other resources like CPU credits and I/O throughput. They simply do not get the same resource tier as larger instances.
Dependence on Burst Network Capacity: To handle traffic, these small nodes often rely on burstable network and CPU performance. While suitable for short spikes, this is unsustainable for the consistent, high load typical of a data layer, leading to throttling.
Increased Internal Traffic: A large cluster of many small nodes generates substantial east-west traffic for coordination, gossip, and client redirection. This chatter consumes a portion of the already limited network bandwidth, further saturating the available pipe.
The Noisy Neighbor Problem: Small instances share physical hardware (network, I/O) with many other cloud tenants. If a neighboring workload on the same host becomes active, your instance can experience sudden resource contention, leading to unpredictable P99 latency spikes and inconsistent throughput—a problem inherent to being tied to many small, shared nodes.

Thus, users face a difficult trade-off: Scaling out is architecturally necessary for Redis/Valkey to increase performance, but it forces deployments into the smallest, most resource-constrained instance tiers. The result is that Redis/Valkey clusters are structurally more exposed to cloud resource limits, with CPU, memory, and network bandwidth often becoming bottlenecks in different ways for various types of workloads.

Dragonfly’s Modern Architecture Unlocks the Full Potential of Large Instances

Dragonfly’s advantage comes from being able to fully leverage large cloud instances by scaling both vertically and horizontally.

Unlike the inherently single-threaded architecture of Redis and Valkey that forces users into a scale-out model with many small instances, Dragonfly is a modern, multi-threaded system designed to maximize modern hardware. Its design allows it to scale up first by fully utilizing all the CPU cores and memory of a single, large machine, delivering significantly higher throughput per node. When greater capacity is needed, Dragonfly Swarm, our clustering solution, enables you to scale out by connecting fewer large, powerful instances, rather than dozens of small ones.

This architectural difference has direct and significant implications for performance and efficiency:

Higher Guaranteed Resources: Larger instances provide not just more guaranteed network bandwidth, but also stable allocations of vCPUs and other resources. This eliminates dependence on unpredictable burst networking, making Dragonfly ideal for consistent, high-throughput workloads.
Reduced Cluster Complexity: By running clusters with fewer, larger nodes, Dragonfly Swarm significantly simplifies operational complexity. Besides, Dragonfly Swarm doesn’t rely on continuous gossip protocols for cluster state propagation. Communication between nodes primarily occurs only during topology changes, such as slot migrations. This architecture ensures that the valuable network bandwidth of your large instances is preserved almost entirely for serving application requests.
Effective Isolation from Noisy Neighbors: Larger instances receive more isolated network and other resources and are treated as higher-priority residents by cloud providers. By running on them, Dragonfly effectively avoids the variable performance caused by sharing heavily contended hardware with other tenants, resulting in predictable, low-latency performance.

But scaling efficiently isn’t just about handling uniform load. It’s also about managing the inherent imbalance of real-world data access patterns, a challenge where Dragonfly’s architecture again proves decisive.

Hotspots and Traffic Distribution

Distributed systems cannot assume uniform workload distribution across all nodes. Instead, they often serve workloads that follow a Zipfian distribution. This is a common pattern where a small subset of keys is accessed far more frequently than the rest. For example, a handful of popular items might receive 80% of all requests. This power law pattern can have a dramatic effect on cluster deployments.

Hotspots Are Amplified by Many Small Nodes

When the architecture forces you to distribute a Zipfian workload across many small instances, say, 100 shards, this natural imbalance is magnified. The “hottest” shard, which holds the most popular keys, can bear a massively disproportionate load. Statistically, in such a deployment, the highest-loaded node could experience traffic 4.08x greater than the lowest-loaded node. This creates a severe hotspot, where at least 4x of CPU and network should be used, potentially 4x memory too. In this statistically highly possible scenario, a single small instance becomes the bottleneck for the entire cluster, leading to throttling, latency spikes, and potential failure.

Zipfian Server Load Distribution on 100 Hosts

Dragonfly Architectural Advantage: Stability Through Consolidation

Dragonfly’s ability to scale up and consolidate workloads onto fewer, much larger nodes fundamentally mitigates this risk. When the same Zipfian workload is served by a cluster of 10 powerful instances, the statistical load distribution becomes far more equitable. The extreme peaks and valleys smooth out, with the load ratio between the hottest and coldest nodes dropping dramatically to just 1.27x.

Zipfian Server Load Distribution on 10 Hosts

This stability arises because consolidating data onto fewer nodes inherently blends the popular keys with a larger pool of less-active keys. A single large node has the resource headroom (in CPU, network, and memory bandwidth) to absorb the spike from hot keys without becoming saturated. The result is a more predictable, resilient, and fully utilized cluster where resources are not wasted on idle nodes while others drown in traffic.

Choose a Data Store That Lets You Leverage Modern Hardware

Most in-memory data stores were never designed to take advantage of the larger, higher-bandwidth instances the cloud makes available today. Redis and Valkey spread workloads across many small instances because of their single-threaded architecture. That made sense a decade ago, but today it locks them into the lowest tiers with the highest exposure to noisy neighbors. Their architecture forces them to leave the biggest, most powerful machines on the table.

Dragonfly was built to break that bottleneck. By fully utilizing multi-core hardware, it can run on large instances that come with huge, guaranteed network capabilities. Instead of stitching together dozens of small nodes and hoping the CPU and network stay quiet, you scale vertically first and scale horizontally when necessary and get predictable performance even under the heaviest load.

If your workload moves a lot of data, the data store that can use the big cores, big RAM, and big bandwidth is the one that wins. Dragonfly can, and that’s why the performance difference is so large in real-world systems!