Building a Self-Fuzzing CI Pipeline for Dragonfly
This post covers how we went from ad-hoc manual fuzzing to a fully automated CI pipeline where an LLM generates targeted attack vectors for every pull request.
March 17, 2026

Dragonfly handles millions of operations per second. That kind of throughput means a single edge case in input handling can cascade fast — a malformed RESP frame, an unexpected argument count, a command sequenced in a way nobody thought to test. This post covers how we went from ad-hoc manual fuzzing to a fully automated CI pipeline where an LLM generates targeted attack vectors for every pull request.
Why Fuzz a Database?
Dragonfly is a modern, high-performance in-memory data store compatible with Redis and Memcached. It’s written in C++, uses a shared-nothing multi-threaded architecture, and handles millions of operations per second. That also means it has a large attack surface when it comes to network input.
Unit tests cover the happy path well. Write a thousand tests for SET and GET and you’ll have solid coverage of the expected cases. But what happens when a client sends XPENDING with one argument instead of two? Or SCRIPT FLAGS with a 9-character SHA instead of 40? Or MONITOR inside MULTI/EXEC twice in a row?
That’s the gap fuzzing fills. A fuzzer is a persistent, systematic tool that feeds unexpected input to your server, tracks code coverage, and learns which inputs reach new code paths. It doesn’t get bored. It doesn’t skip the weird cases. And it finds bugs that no human would think to write a test for.
Where It Started: df-afl
Before fuzzing infrastructure lived inside Dragonfly, there was df-afl — a standalone proof of concept. A collection of Python scripts and a bash wrapper that generated random Redis commands and fed them to Dragonfly through AFL++.
It worked, in the way a duct-taped trebuchet works. It could launch things. Command generation was random: pick a command, attach some arbitrary arguments, hope for a crash. No protocol awareness. No state tracking. No CI integration. You had to build AFL++ manually, configure everything by hand, and then watch a terminal.
The key insight from that experiment was simple: even completely brute-force fuzzing found real bugs. If random garbage crashes a server, focused protocol-aware fuzzing will find far more.
Integrating AFL++ Into the Build System
The first step was bringing AFL++ directly into Dragonfly’s build system via a USE_AFL=ON flag. This wasn’t a trivial compile-flag addition — it required architectural work to make the fuzzer effective.
The persistent mode trick
Traditional fuzzing follows a simple loop: start the program, feed it input, check for a crash, kill it, repeat. For a database server that takes a second to initialize, this means spending 99% of the time waiting for startup. That rate produces maybe one execution per second — not useful.
AFL++ has a persistent mode where the target process stays alive and processes multiple inputs in a loop. The implementation runs the Dragonfly server in a background thread while the main thread reads test cases from stdin via __AFL_LOOP, forwards them over TCP, and loops back for the next input. The server handles these as normal client connections. AFL++ treats it as a standard fuzzing target.
This moved throughput from roughly 1 exec/sec to thousands. The fuzzer became faster than the build pipeline.
First crashes
The first fuzzing run found a crash almost immediately — a CHECK failure in dragonfly_connection.cc on malformed RESP input. Then another. And another. The fixes were typically two or three lines. But the bugs lived in edge cases no one would think to write a manual test for.
Protocol-Aware Mutators
Basic byte-level fuzzing has a fundamental problem: most random mutations produce invalid protocol framing. Flip a byte in a RESP message and the parser rejects it before it ever reaches the command handler. You’re testing error handling in the protocol layer, not the actual command logic where interesting bugs live.
The solution was custom AFL++ mutators — one for RESP, one for Memcached. Instead of operating at the byte level, they:
- Parse the input into a list of commands
- Mutate at the command level — swap commands, change arguments, insert new commands, wrap sequences in MULTI/EXEC
- Serialize back to a valid protocol format
The RESP mutator knows about 150+ Redis commands and their arity constraints. It won’t generate a SET with zero arguments because the parser would drop it instantly. But it will generate a SET with an argument that’s -1 characters long, a ZADD with NaN as the score, or an XREADGROUP with an empty consumer name. These are the inputs that get past the parser and into the command handlers where logic bugs hide.
CI Integration
Running fuzzing locally is useful for research. The real value comes from automation. Two workflows now handle this continuously.
Nightly campaigns
Every night at 2 AM UTC, a workflow builds Dragonfly with USE_AFL=ON and runs the fuzzer for 30–60 minutes against both RESP and Memcached protocols. If it finds a crash, the team gets notified. The fuzzer covers the night shift.
PR fuzzing with LLM-generated seeds
Every pull request that touches C++ code triggers a 15-minute fuzzing campaign. But instead of running the fuzzer against the general seed corpus, it does something targeted.
When a PR modifies stream_family.cc, there’s no reason to fuzz the entire command set equally. The focus should be on stream commands — XADD, XREAD, XPENDING, XREADGROUP — with inputs that exercise the specific code paths that changed.
A script takes the PR diff, sends it to Claude Haiku along with all existing seed files for context, and gets back a JSON containing targeted seed files and a focus commands list. The mutator then biases roughly 70% of its mutations toward those commands. The PR number for this feature was #6666. That wasn’t planned.
This works better than expected. Given a diff in script_mgr.cc, the model identifies that the change affects SCRIPT FLAGS and generates seeds with various SHA lengths, including invalid ones. The fuzzer then spends 15 minutes hammering exactly those paths.
When an API key isn’t available — fork PRs from external contributors — the system falls back to the existing seed corpus. No key, no problem.
A Bug It Actually Caught
During a PR that refactored stream functions, the PR fuzzing action caught that XPENDING had arity -2 (minimum 1 argument), but the handler unconditionally accessed the second argument. A classic off-by-one that unit tests didn’t catch because no one writes a test for “call XPENDING with only the key and no group name.” The fix was two lines. Without the fuzzer running on that PR, this would have reached production.
Crash Replay
There’s a subtlety with persistent-mode fuzzing: since the server stays alive across iterations, crashes often depend on accumulated state. A crashing input might only reproduce because previous inputs in the same session built up a specific combination of keys, connections, and transactions.
Replaying just the crashing input doesn’t work. You need the entire sequence.
AFL++ has a feature called AFL_PERSISTENT_RECORD that saves all inputs in the current cycle. A replay script reads these files in order and sends them to a fresh Dragonfly instance, reproducing the exact state that led to the crash. A packaging script bundles everything into a tarball that can be handed to another developer: “here, reproduce this on your machine.”
Getting this right took a few iterations. The first version didn’t handle gaps in RECORD file numbering — AFL++ sometimes skips numbers, and the replay would fail on missing files.
What We Learned
Start with something ugly. df-afl was a hacky mess, but it proved the concept and found real bugs. Waiting for a clean design means never shipping.
Byte-level fuzzing isn’t enough. It finds parser bugs. The logic bugs that cause production incidents only appear when the fuzzer speaks the protocol correctly enough to reach command handlers.
Persistent mode is non-negotiable. Without it, fuzzing a database server is too slow to be useful. With it, you get thousands of executions per second.
Reproducing crashes is harder than finding them. The crash is just the start. Packaging a reliable reproduction case that another developer can run is where the actual work lives.
LLMs and fuzzers are a natural pairing. Not because the model is doing anything magical — it’s reading a diff and making reasonable guesses about which edge cases are relevant. The fuzzer still does the actual work. But the targeting meaningfully improves what gets exercised in a 15-minute window.
If it’s not in CI, it doesn’t exist. A fuzzer that only runs when someone remembers to start it is decoration.
What’s Next
The core pipeline is in place — nightly runs, PR-level checks, crash replay and packaging. The next areas to tackle:
Cluster mode and replication. Current fuzzing targets single-node setups. Dragonfly’s cluster mode and replication have their own complex state machines. Applying the same treatment there is the natural next step.
Hang detection. The fuzzer currently catches crashes. Some bugs don’t crash the server — they make it hang. Detecting and reproducing hangs requires a different approach: timeouts, watchdog threads, and a mechanism to capture server state when it stops responding.
The fuzzing infrastructure is open source in the fuzz/ directory of the Dragonfly repository. If you’re running Redis in production and dealing with the kinds of edge cases that only show up at scale, Dragonfly is worth a look.
