Keeping Dragonfly Always-On: High Availability Options Explained

High Availability for Dragonfly in Production

When deploying Dragonfly, high availability (HA) is critical. As a foundational component of your infrastructure, downtime can have significant consequences. Dragonfly supports real-time use cases like caching and streaming, and making a properly configured HA setup is essential for maintaining application resilience and minimizing disruptions.

In this blog post, we’ll explore three robust approaches to setting up high availability for Dragonfly:

Redis Sentinel: A battle-tested HA system designed for Redis, which can be used with Dragonfly as well.
Dragonfly Kubernetes Operator: A cloud-native way to manage HA in containerized environments.
Dragonfly Cloud: A fully managed solution with built-in redundancy and failover.

By the end of this guide, you’ll understand how to implement each method, their trade-offs, and best practices for keeping your Dragonfly deployment highly available. Let’s get started.

Redis Sentinel: A Classic Solution within the Ecosystem

Understanding Redis Sentinel

Redis Sentinel is a robust, distributed monitoring and failover management system originally designed for Redis that works seamlessly with Dragonfly. Essentially, a Sentinel instance is Redis running in a special mode. It doesn’t handle actual data requests but instead operates as a process that provides critical functions for maintaining high availability:

Instance Monitoring: Sentinel continuously checks the health of Dragonfly primary (master) and replica instances. It detects failures through heartbeat mechanisms and response timeouts.
Failure Notification: Sentinel can integrate with alerting systems to notify administrators of degradation or outages.
Service Discovery: Sentinel acts as an authoritative source of truth for clients needing to locate the current primary instance. This eliminates manual reconfiguration after failover events.
Automatic Failover: If a primary instance node fails, Sentinel orchestrates the promotion of a replica to primary. It then updates the system topology to maintain service continuity.

Dragonfly maintains full compatibility with Redis Sentinel’s protocol and behavior. This means you can integrate Dragonfly into existing Redis Sentinel deployments with zero to minimal changes.

Designing a Reliable Sentinel Topology

For a production-ready Sentinel deployment with Dragonfly, consider a minimum of 3 Sentinel nodes. Yes, that’s right—multiple Sentinel instances can be used to form a distributed system to provide high availability for the Sentinel deployment itself, preventing it from becoming a single point of failure. An odd number can avoid split-brain scenarios and ensures quorum (at least 2 out of the 3 nodes must agree for failover decisions), while you can run more (5, 7, etc.). Additional best practices include distributing Sentinels across separate physical servers, avoiding co-locating Sentinel and Dragonfly instances, and maintaining network proximity between them.

High Availability | Redis Sentinel

Setting Up Dragonfly with Sentinel

Let’s see an example in action. First, let’s run 3 Dragonfly instances with 1 primary and 2 replicas. To make the process more reproducible, we can use Docker Compose. Also note that the complete configuration files can be found in our GitHub repository of examples.

services:
  dragonfly-0:
    container_name: "dragonfly-0"
    image: "docker.dragonflydb.io/dragonflydb/dragonfly"
    ports:
      - "6379:6379"
    command:
      - "--port=6379"
  dragonfly-1:
    container_name: "dragonfly-1"
    image: "docker.dragonflydb.io/dragonflydb/dragonfly"
    depends_on:
      - "dragonfly-0"
    ports:
      - "6380:6380"
    command:
      - "--port=6380"
      - "--replicaof=dragonfly-0:6379"
  dragonfly-2:
    container_name: "dragonfly-2"
    image: "docker.dragonflydb.io/dragonflydb/dragonfly"
    depends_on:
      - "dragonfly-0"
    ports:
      - "6381:6381"
    command:
      - "--port=6381"
      - "--replicaof=dragonfly-0:6379"

By omitting some details to reduce noise, you can see that we make dragonfly-0 the primary instance and let dragonfly-1 and dragonfly-2 be the replicas:

dragonfly-0, dragonfly-1, and dragonfly-2 run on ports 6379, 6380, and 6381, respectively.
dragonfly-1 and dragonfly-2 depend on dragonfly-0, which means that they wait for dragonfly-0 to boot first.
Also, they are started with the server flag --replicaof which makes them replicas of dragonfly-0.

Next, we can set up Redis instances running in Sentinel mode.

# sentinel.conf

port 8000

# sentinel monitor <primary-name> <host-or-ip> <port> <quorum>
#
# Tells Sentinel to monitor this primary, and to consider it in the
# 'objectively down' state only if at least <quorum> sentinels agree.
#
# Replicas are auto-discovered, so you don't need to specify replicas in
# any way. Sentinel itself will rewrite this configuration file adding
# the replicas using additional configuration options.
#
# Also note that the configuration file is rewritten when a
# replica is promoted to master.
sentinel monitor default-primary dragonfly-0 6379 2

sentinel down-after-milliseconds default-primary 1000
sentinel failover-timeout default-primary 2000
sentinel parallel-syncs default-primary 1

As shown in the minimal sentinel.conf file above, it specifies the port for a Sentinel instance first. Then, the most important configuration is sentinel monitor, which tells Sentinel to monitor a specific Dragonfly primary instance. It is notable that Dragonfly replicas and Sentinel peers are auto-discovered, so there’s no need to specify them. While running, Sentinel itself will rewrite this configuration file, adding discovered replicas and their status through automatically generated configuration entries. Beyond these core elements, the configuration file includes various tunable parameters that govern Sentinel’s behavior for down-detection timing, failover procedures, and other operational thresholds. More details are well documented here.

Similar to above, we can spin up 3 Sentinel instances to monitor our Dragonfly instances (1 primary and 2 replicas) that are running previously:

services:
  # dragonfly-0: ...
  # dragonfly-1: ...
  # dragonfly-2: ...
  sentinel-0:
    container_name: "sentinel-0"
    image: "redis:6.0-alpine"
    depends_on:
      - "dragonfly-0"
      - "dragonfly-1"
      - "dragonfly-2"
    ports:
      - "8000:8000"
    command: "redis-server /etc/sentinel-config/sentinel.conf --sentinel"
    volumes:
      - "./config/sentinel-0:/etc/sentinel-config"
  # sentinel-1: ...
  # sentinel-2: ...

Again, by reducing some noise, we are essentially running Redis in Sentinel mode by specifying the sentinel.conf file and the --sentinel flag. With everything in place, we can run all the instances together with docker compose:

$> pwd
#=> /XXX/dragonfly-examples/high-availability/sentinel

$> tree
#=> .
#=> ├── README.md
#=> ├── config
#=> │   ├── sentinel-0
#=> │   │   └── sentinel.conf
#=> │   ├── sentinel-1
#=> │   │   └── sentinel.conf
#=> │   └── sentinel-2
#=> │       └── sentinel.conf
#=> └── docker-compose.yml

$> docker compose up -d

We can verify the deployment by connecting to Dragonfly or Sentinel instances. Also, it can be interesting to issue the SHUTDOWN command to the primary Dragonfly instance and see Sentinels performing a failover seamlessly:

# Send commands to the 'dragonfly-0' instance.

dragonfly-0$> INFO REPLICATION
#=> role:master
#=> connected_slaves:2
#=> slave0:ip=172.19.0.3,port=6380,state=online,lag=0
#=> slave1:ip=172.19.0.4,port=6381,state=online,lag=0
#=> master_replid:ac470e87cd393cd2d99426889112feddb0ffacbe

dragonfly-0$> SHUTDOWN
#=> OK

# Sentinel output for automatic failover.

#=> +sdown master default-primary 172.19.0.2 6379
#=> +new-epoch 1
#=> +vote-for-leader 9cbe7cd5915ee11c71b67fff041ab7ec0da012e2 1
#=> +odown master default-primary 172.19.0.2 6379 #quorum 3/2
#=> Next failover delay: I will not start a failover before Thu Aug 14 01:02:03 2025
#=> +config-update-from sentinel 9cbe7cd5915ee11c71b67fff041ab7ec0da012e2 172.19.0.7 8001 @ default-primary 172.19.0.2 6379
#=> +switch-master default-primary 172.19.0.2 6379 172.19.0.3 6380
#=> +slave slave 172.19.0.4:6381 172.19.0.4 6381 @ default-primary 172.19.0.3 6380
#=> +slave slave 172.19.0.2:6379 172.19.0.2 6379 @ default-primary 172.19.0.3 6380
#=> +sdown slave 172.19.0.2:6379 172.19.0.2 6379 @ default-primary 172.19.0.3 6380

Redis Sentinel remains a classic, battle-tested solution for achieving high availability with Dragonfly. Beyond the basic setup, Sentinel offers tunable configurations to optimize failover behavior, downtime detection, and quorum requirements. During failovers, Sentinel employs a sophisticated replica selection process, evaluating factors like disconnection time, replication priority, processed offset, and run ID to promote the most suitable replica. By default, replicas with the lowest replica-priority and most up-to-date data are preferred, ensuring a deterministic and reliable transition. While most setups work well with the default, advanced users can fine-tune replica eligibility and other configurations to meet specific requirements.

Dragonfly Kubernetes Operator: A Cloud-Native Approach

For cloud-native deployments, the Dragonfly Kubernetes Operator provides a streamlined way to manage Dragonfly instances on Kubernetes, leveraging built-in orchestration features like health checks and scaling. While not employing the same sophisticated replica selection process Sentinel has, the operator is still capable of performing failover as well as providing plenty of cloud-native features that integrate well with Kubernetes workflows. More details can be found in our documentation and our announcement blog post. Let’s take a look at an example focusing on high availability.

# dragonfly-sample.yaml
apiVersion: dragonflydb.io/v1alpha1
kind: Dragonfly
metadata:
  labels:
    app.kubernetes.io/name: dragonfly
    app.kubernetes.io/instance: dragonfly-sample
    app.kubernetes.io/part-of: dragonfly-operator
    app.kubernetes.io/managed-by: kustomize
    app.kubernetes.io/created-by: dragonfly-operator
  name: dragonfly-sample
spec:
  replicas: 3
  resources:
    requests:
      cpu: 8
      memory: 200Gi
    limits:
      cpu: 8
      memory: 300Gi

Once the Dragonfly Kubernetes Operator is installed, the YAML file above can be used to manage Dragonfly within your Kubernetes cluster:

$> kubectl apply -f dragonfly-sample.yaml

The Dragonfly Kubernetes Operator follows Kubernetes conventions for replica counts while adapting them to stateful data store requirements. The configuration translates to Dragonfly’s high-availability architecture as follows:

replicas: 1 → Single Dragonfly primary instance (no HA)
replicas: 2 → 1 Dragonfly primary + 1 Dragonfly replica
replicas: 3 → 1 Dragonfly primary + 2 Dragonfly replicas
replicas: N → 1 Dragonfly primary + (N-1) Dragonfly replicas

High Availability | Dragonfly Kubernetes Operator

This pattern ensures exactly one primary instance exists at all times, with all additional pods serving as asynchronous replicas. With that said, the operator is capable of managing high-availability setups but not horizontally scalable sharded clusters. Once Dragonfly is running within Kubernetes, issuing the SHUTDOWN command to the primary Dragonfly instance also triggers the operator to perform an automatic failover, similar to what we see with Sentinel.

Dragonfly Cloud: Auto-Piloted High Availability

Dragonfly Cloud offers fully managed Dragonfly data stores, including highly available deployments, handling failover automatically without manual intervention. Whether you’re using a single-shard Dragonfly instance or a Dragonfly Swarm multi-shard distributed cluster, replicas ensure resilience against failures while maintaining performance.

In standalone mode, you can provision up to 2 replicas (3 total instances) distributed across availability zones (AZs) for redundancy, while Swarm mode extends this protection by enabling replicas for each shard in multi-shard distributed clusters. The system automatically synchronizes all replicas with their primary instances. When primary failures occur, Dragonfly Cloud’s continuous monitoring promotes the healthiest replica within seconds and seamlessly redirects clients through the managed endpoint without requiring code changes. All operations, including scaling replicas, upgrading compute resources, or adjusting memory, are performed with zero downtime.

High Availability | Dragonfly Cloud

As shown above, configuring high availability in Dragonfly Cloud couldn’t be simpler. Just navigate to the Durability & High Availability section when creating or editing your data store, click the +Add Replica button, and select your preferred availability zones. You can place replicas in the same zone as your primary for low latency or distribute them across zones for maximum resilience. Once configured, let us handle all replication, failover, and maintenance so that you can focus on your application development.

Final Thoughts

Dragonfly offers multiple robust approaches to high availability, each suited to different deployment needs. Ultimately, your choice depends on your infrastructure and reliability requirements. For maximum control within a familiar ecosystem, Sentinel can be an excellent choice. For Kubernetes-native integration, the operator balances convenience with customization. And if you’d rather offload the complexity entirely, Dragonfly Cloud delivers high availability (and all other features and operations) on autopilot. Whichever path you choose, Dragonfly ensures your in-memory data stays available and performant. So go ahead, scale and failover seamlessly. Happy caching, happy building.