We’re Ready for You Now: Dragonfly In-Memory DB Now Supports Replication for High Availability

Dragonfly is a highly performant in-memory database that can act as a drop-in Redis replacement. Version 1.0 of Dragonfly is production ready and includes database replication, making it easy to migrate from Redis and suitable for high-availability deployments.

High availability refers to a system’s ability to operate continuously — without downtime or failure — preferably by using built-in failover mechanisms rather than over-provisioning. High availability is important to any modern application: from social media to important financial software applications, users expect performant applications with no downtime.

Dragonfly’s replication process is fast and memory-reliable, and it supports high throughput. This makes it perfect for high-availability solutions, especially where existing Redis deployments are struggling.

Dragonfly makes migration easy, providing an API that is fully compatible with Redis. This means you can use it with your existing applications and libraries without the need to make any code changes.

In this article, we explain what database replication is and how you can leverage it to improve availability. We also explain how the replication process works in Dragonfly and how it differs from replication in Redis. Finally, we will cover how to configure and manage replication in Dragonfly.

What is database replication?

Database replication is the process of continuously copying the contents of a primary database to replica databases, which are usually running on different servers. Keeping multiple copies of the same database allows your system to avoid a single point of failure — if your primary node fails, a replica can be automatically promoted to be the new master node, becoming the new destination for data writes and the source of any other remaining replicas. This allows systems to recover from failure without data loss, decreased performance, or downtime.

How does replication help with high availability?

Database replication is one of the cornerstones of any highly available in-memory database system. Without replication in place, if your database crashes, you will have to manually restore from a backup. This is time consuming, and even worse, you will have lost any data that was written to your database since your last backup. With a reliable database replication implementation, including a monitoring system to detect instability or failures, your application will be able to deal with a primary database node failing by automatically falling back to a replica node and continuing to operate as normal.

How Dragonfly replication works

For developers, replication in Dragonfly works the same as Redis replication and implements the exact same API for compatibility and ease of learning. However, the underlying code is all Dragonfly. One of the key differences is Dragonfly’s snapshotting algorithm, which makes replication reliable without drastically affecting performance.

Dragonfly uses a shared-nothing architecture that allows data to be replicated in parallel over different connections, with one connection per thread. This shared-nothing architecture is made possible by the fact that Dragonfly was designed to allow multiple threads, making it super fast. Stored data is spread across different shards within Dragonfly, each holding a different set of keys. As part of Dragonfly’s multi-threaded design, each thread is responsible for a specific shard and performs operations on this shard, such as replicating the shard’s data.

In order for replication to begin, a handshake phase occurs, where connections are opened between replica and master to each shard thread to send the data into the replica.

The replica establishes connections with the master node and exchanges metadata details.

Once these connections are established, there is a “full synchronization” phase. This is where the whole dataset is copied from master to replica. Any updates that happen to the master database while the full database snapshot is being copied are sent to the replica in parallel. The snapshot and the updates happen in parallel thanks to Dragonfly’s snapshotting algorithm, which is using fibers.

Updates are pushed to the replica along with the full snapshot, which saves significant amounts of memory and makes the process more stable.

When each shard finishes its part of the full sync, it changes its status to “streaming state,” which means it’s ready to begin its final, “stable synchronization” phase of sending updates to the replica. Once all shards are in this state, the “stable synchronization” phase begins.

The stable synchronization phase is the standard operating phase of Dragonfly replication. Every time there is an update to the primary database, the change gets streamed asynchronously to the replicas.

How replication in Dragonfly is different from that in Redis

Redis uses a single thread for its replication process, whereas Dragonfly is multi-threaded and able to use all CPU threads made available to it for the replication process. Because of this, Dragonfly can replicate shards data in parallel, making replication super-fast.

Dragonfly’s efficient snapshotting algorithm allows it to send update commands to the replica at the same time as sending the full snapshot. Redis, however, sends the full snapshot first — and stores update commands in an in-memory buffer, which can later be sent to the replica (once snapshotting is complete).

Redis’s replication method can lead to large memory spikes, as shown below. This happens because it uses the lazy copy-on-write operation, which causes memory pages to be duplicated. This means that as the size of your database increases, small writes can use a lot of memory. In turn, this causes the memory to quickly reach 100% capacity, leading to large latency spikes.

It’s common for Redis’s memory usage to double during replication. This can happen in large spikes, making it difficult to predict memory usage. Therefore, if you’re using Redis, you will need to over-provision its servers to avoid the possibility of your system running out of memory and crashing. In Dragonfly, the memory overhead of replication is constant and not affected by the dataset size. This makes it suitable for situations that require high availability but where it is not appropriate to have excess capacity continuously online to handle unexpected traffic spikes.

One final issue worth mentioning that can happen with Redis replication is that the buffer used to store updates while the snapshot is being copied to the replica is finite in size: the higher your workload is, the more likely that this buffer will fill completely. If this happens, the replication process will completely restart. After this, exactly the same problem will happen again: the buffer will reach capacity and replication will restart, causing an endless replication loop. In Dragonfly’s snapshotting algorithm, there is no need for a replication buffer. Instead, it sends the updates to the replica right away. This makes the replication process very stable, and the server can handle a high workload while replicating the data.

A comparison of Redis and Dragonfly — benchmark tests

We performed some tests on the performance of Redis and Dragonfly during replication. These tests were run on single instances of Redis and Dragonfly hosted on an AWS c5n.9xlarge instance with 36 virtual CPUs and 96 GiB of RAM. We used Redis Labs’s memtier_benchmark tool to perform our tests.

The results of these tests show that Dragonfly’s throughput was 7.6 times faster than that of Redis, while its average latency was 7.6 times lower, and its tail (P99) latency was 3.3 times lower. Also, the “full sync” phase of replication was 5.5 times faster in Dragonfly than Redis, and there were no noticeable memory spikes for Dragonfly, unlike Redis.

Dragonfly’s throughput while replication was running was 1,205,511.27 operations per second, while Redis’s was 159,222.92 operations per second.

The average latency while replication was running was 0.44763 ms per second for Dragonfly and 3.39067 ms for Redis, showing that it’s possible to have high throughput and low latency.

The P99 latency while replication was running was 2.399 ms per second for Dragonfly and 7.839 ms for Redis. Tail latency is low in Dragonfly because its replication process doesn’t cause memory spikes.

Dragonfly’s “full sync” replication phrase is 5.5 times faster than that of Redis, and while Redis suffers from large memory spikes during replication, Dragonfly has no noticeable memory spikes.

How to configure and manage Dragonfly replication for high availability

Our replication documentation gives the full details of how to manage Dragonfly replication and use replication to migrate from Redis to Dragonfly. The Dragonfly replication management API is fully compatible with the Redis API and consists of two user-facing commands: ROLE and REPLICAOF (SLAVEOF).

Rather than using the API directly, you can use a high-availability monitoring system such as Redis Sentinel to manage failover automatically. Dragonfly is fully compatible with Redis Sentinel, which can detect when a master instance has failed and automatically promote a replica to be the next master node.

Dragonfly is ready to simplify your in-memory database operations

Dragonfly’s 1.0 GA release now includes full support for replication, making it perfect for high-availability deployments. It is fast, memory-reliable, and easy to manage, scaling vertically to support millions of operations per second and terabyte-sized workloads. It does this all on a single instance, so you don’t need to manage a cluster.

As Dragonfly has the same API as Redis, you can use it as a drop-in replacement in your production environments.

You can start using Dragonfly in a Dockerized container in a matter of minutes, and you’ll be able to see how fast it performs right away. We’ve also published some key performance comparisons with Redis here.