Building RAG Systems with LlamaIndex and Dragonfly

Large language models (LLMs) are powerful, but they have one big limitation: their knowledge is frozen at training time. If you ask them about your company’s latest product release, the model won’t know unless it was included in the training data. That’s where retrieval-augmented generation (RAG) comes in.

RAG is a pattern that combines two steps:

Retrieve: Search an external knowledge base (like your docs, PDFs, or database) for the most relevant information.
Augment + Generate: Pass that information to the LLM so it can generate an answer based on your actual data.

This allows your LLM to stay up-to-date and domain-specific without requiring retraining. Because of this, it’s no surprise that RAG has quickly become one of the most popular ways teams are building practical AI systems today.

But building RAG systems can get complicated and messy. LlamaIndex is an open-source framework that makes this process much easier, along with agentic AI and other LLM-powered workflows. In this tutorial, we’ll see how you can combine LlamaIndex with Dragonfly (as a vector store) to create RAG systems simply and efficiently. But first, let’s have a quick refresher on vector stores.

Why Do Vector Stores Matter?

The way RAG systems work is that every chunk of text you feed into your pipeline is converted into an embedding. We won’t go deep into the math, but you can think of embeddings as high-dimensional vector representations of the text in your documents. All of these embeddings need to be stored somewhere, and that’s the job of a vector store.

Since your embeddings are stored in the vector store, its performance is critical for your RAG system. Choosing a performant and reliable vector store is one of the most important decisions when building a RAG pipeline.

Dragonfly is a drop-in replacement for Redis, built from the ground up for modern workloads. It’s fully Redis compatible but optimized for today’s hardware and concurrency requirements. Here’s why it is an excellent fit as a vector store in RAG systems:

Unmatched Performance: Designed with a multi-threaded architecture, Dragonfly handles millions of GET/SET operations per second on a single server, making real-time in-memory vector search snappy even at scale.
Redis Compatibility: It speaks the Redis protocol, so you can use existing Redis clients and libraries without modification. As we’ll soon see, LlamaIndex’s RedisVectorStore works with it out of the box.
Operational Simplicity: Run it as a single container, and you get high throughput and efficient memory usage right away. Horizontally scale further with the Dragonfly Swarm multi-shard cluster when needed.

In short, Dragonfly gives you the low-latency, high-throughput vector search you need, without the usual tradeoffs in setup complexity or performance compared with legacy solutions.

Now that we’re on the same page about vector stores, let’s see a step-by-step tutorial on how we can create a RAG system using LlamaIndex and Dragonfly.

Creating a RAG System

Setting up the Python Environment

The first thing we’re going to need to do is set up our Python environment locally. To do that, create a virtual environment and activate it:

python -m venv .venv && source .venv/bin/activate

After that we’ll be installing the packages we need (including the packages from LlamaIndex) to build our RAG system:

pip install -U \
  llama-index \
  llama-index-llms-openai \
  llama-index-embeddings-openai \
  llama-index-vector-stores-redis \
  redis \
  openai

Once this is done, we’re ready to start writing some code!

Downloading our Dataset

Create a dragonfly_demo.py file with the following code:

import pathlib
import urllib.request

DATA_URL = (
    "https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt"
)
DATA_DIR = pathlib.Path("data")
DATA_DIR.mkdir(exist_ok=True)
FILE_PATH = DATA_DIR / "paul_graham_essay.txt"
if not FILE_PATH.exists():
    urllib.request.urlretrieve(DATA_URL, str(FILE_PATH))

We are fans of Paul Graham’s essays, so we thought it’d be a nice idea to train an LLM on that data. What the above code does is basically download a text file containing those essays from the internet and save it to your computer in a folder called data, but only if they don’t already exist locally.

Configuring the OpenAI API

Our app needs to talk to an LLM for two things:

To convert the text the user types and the documents into embeddings.
To get a response for the user based on all the relevant info from the prompt and the vector store.

For this, we’ll use OpenAI.

Create an OpenAI account and generate an API key.
Set it as an environment variable so the code can read it:

export OPENAI_API_KEY="sk-...your-key..."

Next, add the following code to the Python file to check if the environment variable has been set or not:

import os
# other dependencies from above

# other code from above
if "OPENAI_API_KEY" not in os.environ:
    raise RuntimeError("Please set OPENAI_API_KEY before running this script.")

Loading the Document into LlamaIndex

Next, we need to read our downloaded dataset and turn it into LlamaIndex documents. A document in LlamaIndex is a generic container around a data source (in this case, our essays). Documents can be constructed manually or created automatically via data loaders. Here, we use a data loader:

from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader(str(DATA_DIR)).load_data()
print(
    "Document ID:", documents[0].id_,
    "Document Filename:", documents[0].metadata.get("file_name")
)

Connecting LlamaIndex to Dragonfly

After we’ve loaded our documents, we need to create a client to connect to our local Dragonfly instance (which we’ll be running locally using Docker). The following code does that:

from redis import Redis
from llama_index.core import StorageContext
from llama_index.vector_stores.redis import RedisVectorStore

redis_client = Redis.from_url("redis://localhost:6379")
vector_store = RedisVectorStore(redis_client=redis_client, overwrite=True)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

In this code, RedisVectorStore(...) wraps the Redis client we created so that LlamaIndex can store and fetch embeddings. Setting overwrite=True wipes any existing index with the same name for a clean run (we’ll cover what indexes are in the next section). And StorageContext is just LlamaIndex’s way of passing storage backends into the indexing logic.

Building the Index

Now, we turn the text into embeddings and store them in Dragonfly:

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)

In LlamaIndex, after loading our data, we need to build an index over the documents we loaded. An index is a data structure composed of document objects, designed to enable querying by an LLM. Basically, LlamaIndex will:

Chunk your document into small pieces.
Call OpenAI embeddings on each chunk (using your OPENAI_API_KEY).
Store those vectors in Dragonfly via the Redis connector.
Return the index object, which is now our handle to query our data later.

Asking Questions: Retriever vs. RAG

Add this under the indexing code:

import textwrap

query_engine = index.as_query_engine()   # for full LLM answers (RAG)
retriever = index.as_retriever()         # for raw retrieved chunks only

print("\\n--- Retriever results: 'What did the author learn?' ---")
result_nodes = retriever.retrieve("What did the author learn?")
for node in result_nodes:
    print(node)

print("\\n--- RAG response: 'What did the author learn?' ---")
response = query_engine.query("What did the author learn?")
print(textwrap.fill(str(response), 100))

print("Done.")

So, there are two different things happening here, which is why we output two responses.

With as_retriever(), only vector search takes place. Your question is turned into an embedding (via an OpenAI API call). The code then finds the closest chunks using vector similarity. And we output these raw chunks and scores. There is no LLM writing a proper response for the user yet.

With as_query_engine(), RAG takes place, that is, retrieval plus an LLM answer. First, all the same steps as in retrieval occur. Then another API call is made to OpenAI to write a final answer using those chunks. You see a readable answer as you’d expect from an AI chatbot.

Full Script & Trying Out The Code

This is what the final script should look like:

import os
import pathlib
import textwrap
import urllib.request

from redis import Redis
from llama_index.core import (
    SimpleDirectoryReader,
    StorageContext,
    VectorStoreIndex,
)
from llama_index.vector_stores.redis import RedisVectorStore

DATA_URL = (
    "https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt"
)
DATA_DIR = pathlib.Path("data")
DATA_DIR.mkdir(exist_ok=True)
FILE_PATH = DATA_DIR / "paul_graham_essay.txt"
if not FILE_PATH.exists():
    urllib.request.urlretrieve(DATA_URL, str(FILE_PATH))

if "OPENAI_API_KEY" not in os.environ:
    raise RuntimeError("Please set OPENAI_API_KEY before running this script.")

documents = SimpleDirectoryReader(str(DATA_DIR)).load_data()
print(
    "Document ID:", documents[0].id_,
    "Document Filename:", documents[0].metadata.get("file_name")
)

redis_client = Redis.from_url("redis://localhost:6379")
vector_store = RedisVectorStore(redis_client=redis_client, overwrite=True)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)

query_engine = index.as_query_engine()
retriever = index.as_retriever()

print("\\n--- Retriever results: 'What did the author learn?' ---")
result_nodes = retriever.retrieve("What did the author learn?")
for node in result_nodes:
    print(node)

print("\\n--- RAG response: 'What did the author learn?' ---")
response = query_engine.query("What did the author learn?")
print(textwrap.fill(str(response), 100))

print("Done.")

Let’s start our Dragonfly container locally and then run this script:

docker run -d -p 6379:6379 --name dragonfly docker.dragonflydb.io/dragonflydb/dragonfly

If you’re using this in production, we’d recommend taking a look at Dragonfly Cloud, which is a managed service that makes running and managing production Dragonfly instances super easy. You can sign up for a free trial here.

Now let’s run our script (make sure you’ve activated the Python venv and set the OpenAI API key as shown above):

python dragonfly_demo.py

And this is what your output should look like (parts about downloading and loading the dataset omitted):

--- Retriever results: 'What did the author learn?' ---
2025-09-14 10:45:09,035 - INFO - HTTP Request: POST <https://api.openai.com/v1/embeddings> "HTTP/1.1 200 OK"
2025-09-14 10:45:09,039 - INFO - Querying index llama_index with query *=>[KNN 2 @vector $vector AS vector_distance] RETURN 5 id doc_id text _node_content vector_distance SORTBY vector_distance ASC DIALECT 2 LIMIT 0 2
2025-09-14 10:45:09,044 - INFO - Found 2 results for query with id ['llama_index/vector_36cdcc18-dd4e-4eb7-808c-613ad048ebd9', 'llama_index/vector_1e2c9d7d-986c-42c6-b9c6-ed31568315f1']
Node ID: 36cdcc18-dd4e-4eb7-808c-613ad048ebd9
Text: What I Worked On  February 2021  Before college the two main
things I worked on, outside of school, were writing and programming. I
didn't write essays. I wrote what beginning writers were supposed to
write then, and probably still are: short stories. My stories were
awful. They had hardly any plot, just characters with strong feelings,
which I ...
Score:  0.820

--- RAG response: 'What did the author learn?' ---
2025-09-14 10:45:09,955 - INFO - HTTP Request: POST <https://api.openai.com/v1/embeddings> "HTTP/1.1 200 OK"
2025-09-14 10:45:09,958 - INFO - Querying index llama_index with query *=>[KNN 2 @vector $vector AS vector_distance] RETURN 5 id doc_id text _node_content vector_distance SORTBY vector_distance ASC DIALECT 2 LIMIT 0 2
2025-09-14 10:45:09,960 - INFO - Found 2 results for query with id ['llama_index/vector_36cdcc18-dd4e-4eb7-808c-613ad048ebd9', 'llama_index/vector_1e2c9d7d-986c-42c6-b9c6-ed31568315f1']
2025-09-14 10:45:10,941 - INFO - HTTP Request: POST <https://api.openai.com/v1/chat/completions> "HTTP/1.1 200 OK"
The author learned that programming on microcomputers was a significant improvement over programming
on larger machines like the IBM 1401, as it allowed for immediate feedback and interaction with the
computer.
Done.

As you can see, there is a clear distinction between the retriever results and the RAG response. The retriever provides the exact text passages from the essays, while the RAG response is the LLM’s generated output based on that retrieved data. The following diagram summarizes this process. It’s also worth noting how Dragonfly worked out-of-the-box with the Redis client/connector, and LlamaIndex made the entire integration seamless.

Building RAG Systems with LlamaIndex and Dragonfly Demo

LlamaIndex & Dragonfly: A Perfect Combo for Building RAG Systems

A lot of people assume that building RAG systems is just about picking the best and most powerful LLM. But the choices of the LLM framework and the vector store have an equally important impact on the speed and reliability of your system. LlamaIndex makes it extremely straightforward to wire everything together, and you just need a data store that can keep up when your data and queries grow.

Dragonfly fits in naturally. Since it’s Redis-compatible, it drops right into the ecosystem most AI developers already know. At the same time, its modern architecture gives you the performance and efficiency needed for embedding-heavy workloads like RAG.

If you’re experimenting with RAG or other LLM workflows, LlamaIndex and Dragonfly are both open source and free to try. Combining the two, they keep both the development and operational sides simple while giving you the real-time generative AI experience. In other words: fewer headaches, faster results, and a solid foundation to build on :)