How Dragonfly Cloud Cuts Failover to One Second

High availability in production caching and data layers isn't optional — it's table stakes. Yet the industry standard, Redis Sentinel, forces teams to build and maintain a parallel monitoring infrastructure just to keep their data plane alive. Dragonfly Cloud takes a fundamentally different approach: embedding failure detection directly into the data plane, eliminating external dependencies, and cutting failover times from 30+ seconds to about one second for software failures.

This post breaks down how Dragonfly Cloud's HA architecture works, why it's faster and simpler than Redis Sentinel, and what this means for teams running latency-sensitive workloads at scale.

The Problem with Redis Sentinel

Redis Sentinel is battle-tested, but its architecture carries inherent tradeoffs that become painful at scale:

You're managing two systems instead of one. Sentinel requires a minimum of three dedicated monitoring processes, typically deployed on separate machines. That's three more processes to provision, configure, secure, patch, and monitor. Each sentinel instance needs its own availability story — if your sentinels go down, so does your ability to failover.

Failure detection is slow by default. Sentinel's down-after-milliseconds defaults to 30 seconds. Even when tuned aggressively, the detection-to-promotion pipeline involves multiple round trips: subjective down → objective down (quorum vote) → leader election → SLAVEOF NO ONE → client notification. Real-world failovers routinely exceed 30 seconds.

Clients bear the complexity. Applications must implement Sentinel-aware connection logic: querying sentinels for the current master, subscribing to failover notifications, and reconnecting on promotion. Every client library handles this differently, and edge cases (stale connections, split-brain reads) are yours to debug.

All failures look the same. Sentinel uses a single TCP probe for every failure mode — a crashed process, a dead VM, and a network partition all produce the same signal: a connection that stopped responding. There's no mechanism to distinguish between them, so Sentinel must wait for the same conservative timeout regardless of what actually went wrong.

Scaling multiplies the pain. Each new instance needs Sentinel coverage. At 50 or 100 shards, the Sentinel topology itself becomes a distributed systems problem: configuration drift, quorum miscalculations, and cascading failover storms.

Dragonfly Cloud's Approach: Embedded, Peer-Driven HA

Dragonfly Cloud eliminates the sentinel layer entirely. Every node in the cluster is both a participant and a monitor, and the control plane reacts to health signals in real time.

Dual-Path Failure Detection

The architecture uses two independent detection mechanisms running in parallel, each optimized for a different failure mode:

Path 1: Process Failure — Instant Local Detection (~1 second)

Every Dragonfly node runs a lightweight agent that pings the local Dragonfly process once per second via the admin socket. The agent also monitors the process state through the OS. When the process exits — whether from a crash, OOM kill, or SIGKILL — the agent detects it on the very next health check: the ping fails, the process manager confirms Dragonfly is no longer running, and the agent immediately reports it as down. One failed ping. One second.

The health report is pushed to the control plane API, which detects the healthy-to-unhealthy transition and fires an instant signal to the shard controller via pub/sub. The controller wakes within a second, evaluates the failure, selects the replica with the highest replication offset, and initiates failover. DNS records update automatically.

For cases where Dragonfly is still running but unresponsive (deadlocked, hung), the agent uses a higher threshold of four consecutive failed pings before declaring it down — distinguishing between a process that's gone and one that's temporarily slow.

Path 2: Hardware Failure — Peer ICMP Monitoring (~10 seconds)

When an entire VM dies — kernel panic, cloud provider termination, network isolation — the local agent goes down with it. There's no process to report its own death.

Instead, every node in a shard monitors its peers with ICMP pings at one-second intervals. When a peer becomes unreachable for 10 seconds, the monitoring node reports it as unhealthy. The control plane cross-references this with the failed node's heartbeat status: if the node hasn't reported in for 10+ seconds and at least one peer confirms it's unreachable (with no peers contradicting), the controller declares instance failure.

A signal fires immediately on the healthy-to-unhealthy transition, waking the shard controller. Failover proceeds identically to the process-crash path: select the best replica, promote, update DNS. A replacement node is automatically provisioned.

Why Two Paths?

Process crashes and hardware failures have fundamentally different observability characteristics:

	Process Crash	Hardware Failure
Agent alive?	Yes	No
Detection source	Local (authoritative)	Remote (peer consensus)
Network available?	Yes	Unknown
Detection speed	~1 second	~10 seconds

A single detection mechanism cannot optimally handle both. Sentinel-style TCP probing is an awkward middle ground — too slow for process failures (waiting for timeouts that are irrelevant when the process is simply gone), too blunt for hardware failures (one timeout for all failure modes). By running dedicated, specialized detectors for each failure mode, Dragonfly Cloud eliminates the compromise.

Architectural Comparison

Capability	Redis Sentinel	Dragonfly Cloud
External infrastructure	3+ sentinel processes	None (embedded)
Process failure detection	~30s (configurable)	~1 second
Hardware failure detection	~30s (same timeout, can't distinguish)	~10s (dedicated ICMP + peer consensus)
Client complexity	Sentinel-aware connections	Standard connection (DNS-based)
Failover trigger	Quorum vote → leader election	Signal-driven (instant wake)
Master discovery after failover	Client queries Sentinel	Automatic DNS update
Operational overhead	High (separate topology)	Zero (fully managed)
Scaling behavior	Linear sentinel complexity	No additional components

How the Pieces Fit Together

Process Crash — The Fast Path

T+0s Dragonfly process crashes (OOM, SIGKILL, bug).
Agent's next health check finds process not running.
T+1s Agent reports Dragonfly down to control plane API.
API detects healthy → unhealthy transition.
Signal fires to shard controller via pub/sub.
T+1s Controller wakes immediately.
Selects replica with highest replication offset.
Promotes replica to master. Updates DNS.
T+~1-2s New master serving traffic.

Hardware Failure — The Resilient Path

T+0s VM terminates. Agent and Dragonfly process die instantly.
Heartbeats stop arriving at the control plane.
T+1s Peer nodes begin failing ICMP pings (1-second intervals).
T+10s Peer declares the node unreachable (10-second threshold).
Agent reports unhealthy status to the control plane API.
API detects healthy → unhealthy transition.
Signal fires to shard controller.
T+10s Controller wakes immediately (no polling delay).
Evaluates: heartbeat stale (10s) + peer confirms unreachable
+ no contradicting reports = instance failed.
T+11s Controller selects replica with highest replication offset.
Promotes replica to master. Marks old master as failed.
T+12s DNS record updated to point to new master.
New master begins serving traffic.
T+~2m Replacement node provisioned and joins as new replica.

Compare this to a typical Redis Sentinel hardware failure timeline:

T+0s VM terminates.
T+30s Sentinel down-after-milliseconds expires (default).
Node marked as subjectively down (SDOWN).
T+30s Other sentinels confirm → objectively down (ODOWN).
T+31s Sentinel leader election begins.
T+32s Leader sentinel issues SLAVEOF NO ONE.
T+33s+ Clients detect failover via Sentinel subscription.
Reconnect to new master.
T+60s+ Typical real-world failover completion.

Data Consistency During Failover

Speed without correctness is worse than no failover at all. Dragonfly Cloud's replica selection algorithm maximizes data consistency:

Prefer replicas still connected to the original master, sorted by replication offset. The replica with the highest offset has the most recent data.
Fall back to priority-based selection if no replicas are connected to the expected master (e.g., in a network partition scenario).

This is analogous to Redis Sentinel's replica-priority mechanism, but with an important difference: Dragonfly Cloud always has access to real-time replication offset data because the agent continuously reports it. Sentinel only sees the last-known state from INFO output.

What This Means in Practice

For teams migrating from self-managed Redis with Sentinel:

Remove sentinel infrastructure entirely. No sentinel processes, no sentinel configuration files, no sentinel monitoring dashboards.
Simplify client connections. Point your application at a DNS endpoint. Failover is transparent — the DNS record updates, connections re-establish. No Sentinel-aware client libraries required.
Get faster recovery with zero tuning. ~1-second process failover and ~10-second hardware failover out of the box. No down-after-milliseconds to tune, no quorum sizes to debate.
Scale without scaling your HA topology. Whether you're running 1 shard or 100, the HA mechanism is the same. No additional sentinel capacity planning.

Conclusion

High availability should be a property of the system, not an external add-on. By embedding failure detection into the data plane, using peer-based hardware monitoring, and connecting it all with real-time signaling, Dragonfly Cloud delivers ~1-second process failover and ~10-second hardware failover — with zero additional infrastructure.

The result: your operations team manages fewer moving parts, your applications use simpler connection logic, and your users experience near-imperceptible disruptions when failures inevitably occur.

No Sentinels Required: How Dragonfly Cloud Cuts Failover to One Second

The Problem with Redis Sentinel

Dragonfly Cloud's Approach: Embedded, Peer-Driven HA

Dual-Path Failure Detection

Why Two Paths?

Architectural Comparison

How the Pieces Fit Together

Process Crash — The Fast Path

Hardware Failure — The Resilient Path

Data Consistency During Failover

What This Means in Practice

Conclusion

Switch & save up to 80%

No Sentinels Required: How Dragonfly Cloud Cuts Failover to One Second

The Problem with Redis Sentinel

Dragonfly Cloud's Approach: Embedded, Peer-Driven HA

Dual-Path Failure Detection

Why Two Paths?

Architectural Comparison

How the Pieces Fit Together

Process Crash — The Fast Path

Hardware Failure — The Resilient Path

Data Consistency During Failover

What This Means in Practice

Conclusion

Stay up to date on all things Dragonfly

Switch & save up to 80%