Announcing Dragonfly Search

Introduction

2023 has been a year with remarkable advancements in AI capabilities, and at Dragonfly, we are thrilled to power new use cases with our latest release: Dragonfly Search. This new feature set, debuting in Dragonfly v1.13, is a subset of RediSearch-compatible commands implemented natively in Dragonfly, allowing for both vector search and faceted search use cases in the highly scalable and performant Dragonfly in-memory data store.

In this post, we will guide you through building a simple recommendation system utilizing OpenAI's embeddings in conjunction with Dragonfly's vector search capabilities. Additionally, we'll explore how Dragonfly can serve as a versatile document store, demonstrating its flexibility and efficiency in handling diverse data management tasks.

Dragonfly Search is being released in Beta. We are excited about its development and future potential, but we do not encourage its use in production environments at the time of this writing. Your feedback is immensely valuable to us, and it plays a critical role in shaping and improving Dragonfly Search as we progress towards a more stable version. If you have any feedback, please create a GitHub issue or drop us a link in Discord.

If you want to learn more about Dragonfly Search, please register for our Community Office Hours, where the team will give a technical presentation and take questions.

Fundamentals of Dragonfly Search

Dragonfly Search enables the creation of indexes for selected HASH and JSON values. Entries stored within or associated with an index are often referred to as documents. Each index is constructed based on a specific schema, defining the fields within the indexed values and the way they should be interpreted. Once established, this index facilitates filtering and sorting documents by various properties, much like a traditional database manages conditional queries.

Let's suppose we use Dragonfly to store information about the world's largest cities.

For each city, we store key information including its name, population, and continent. For example:

dragonfly$> HSET city:1 name London population 8.8 continent Europe
dragonfly$> HSET city:2 name Athens population 3.1 continent Europe
dragonfly$> HSET city:3 name Tel-Aviv population 1.3 continent Asia
dragonfly$> HSET city:4 name Hyderabad population 9.8 continent Asia

To build an index, we use the FT.CREATE command. Firstly, we define the index name and the subset of values to index, such as those with keys prefixed with city:. And then, we outline our schema attributes:

The name attribute of type TEXT.
The population attribute as a NUMERIC type with sorting enabled.

dragonfly$> FT.CREATE cities PREFIX 1 city: SCHEMA name TEXT population NUMERIC SORTABLE continent TAG

Finally, the continent attribute as a TAG type. Read more about TAG fields here.

After creating the index, the FT.INFO command can be used to inspect its details. As shown below, the index conforms to the schema we defined, and it contains the hash documents we created earlier:

dragonfly$> FT.INFO cities
1) index_name
2) cities
3) fields
4) 1) 1) identifier
      2) name
      3) attribute
      4) name
      5) type
      6) TEXT
   # schema for 'population' and 'continent' omitted for brevity...
5) num_docs
6) (integer) 4

Moving on to querying!

Our first example query will focus on cities in Europe. We'll sort them by population in descending order and select only the top one document without skipping any. The query is also constructed to return only two fields for each result: name and population.

The response contains the total number of documents matched, regardless of the LIMIT option, and the documents themselves. In this case, only London will be returned, displaying first its key and then the selected fields.

dragonfly$> FT.SEARCH cities '@continent:{Europe}' SORTBY population DESC LIMIT 0 1 RETURN 2 name population
1) (integer) 2 # total number of documents matched
2) "city:1"    # document key (i.e. the key to the HASH document)
3) 1) "name"   # selected fields and their values
   2) "London"
   3) "population"
   4) "8.8"

Our second example query aims to display all cities with a population under 5 million that are situated in Asia as shown below:

dragonfly$> FT.SEARCH cities '@population:[0 5] @continent:{Asia}' RETURN 1 name
1) (integer) 1
2) "city:3"
3) 1) "name"
   2) "Tel-Aviv"

For detailed information on the query syntax, refer to our documentation.

The index is dynamic; it automatically updates as document values are added or removed. In a later section of this blog post, we will look into the storage of JSON values. Contrary to simple hashes, JSON documents can store nested values and arrays, enabling the indexing of more complex data structures.

Vector Search: Finding the Closest Match

After exploring how to create and query indices in the previous chapter, we now turn our attention to the use of the VECTOR field type. This section will demonstrate building a simple recommendation engine using OpenAI's embeddings.

Vector fields can be used for vector similarity search where the goal is to find documents with vector fields most similar to a given vector. Vectors are extremely powerful, as they can encode various complex objects like text, images, and music. The underlying models aim for a fundamental principle: the closer the vectors, the greater the similarity between the original objects. These vectors are colloquially called embeddings, as they embed the original objects into a vector space.

In the realm of modern applications, vector databases are crucial for executing vector similarity searches. Our example illustrates building a simple service to recommend blog articles to users based on their interests. To convert the text of our blog posts into vectors, we'll utilize OpenAI's service.

The preliminary step of gathering all our blog posts along with their embeddings in a CSV file blog-with-embeddings.csv has been completed, which can be found in our dragonfly-examples repository. Now, let's begin by loading this file using the pandas Python library.

import pandas as pd

posts = pd.read_csv('blog-with-embeddings.csv', delimiter=',', quotechar='"', converters={'embedding': pd.eval})
posts.head()

The table shows that each document contains a few fields:

The title field is the blog post title.
The content field is the blog post content.
The embedding field is the vectorized content.

The following step involves initializing Dragonfly, then connecting to it using the official Python Redis client to create our index. We don't need the raw content to be indexed, as we will index the vectorized content instead.

Note that the VectorField constructor accepts additional parameters, such as the algorithm type and the vector dimensions. FLAT is the selected algorithm type and represents brute-force search. An alternative, HNSW (Hierarchical Navigable Small World), is also available. While HNSW can provide approximate results with reduced computational demands, it consumes more memory and provides faster search speed on larger datasets.

The configuration options also define the vector dimensions, in this case, 1536 dimensions.

import redis
from redis.commands.search.field import TextField, VectorField
from redis.commands.search.indexDefinition import IndexDefinition, IndexType

client = redis.Redis()
client.ft("posts").create_index(
        fields = [TextField("title"), VectorField("embedding", "FLAT", {"DIM": "1536"})],
        definition = IndexDefinition(prefix=["post-"], index_type=IndexType.HASH)
)

Our blog posts are represented using the HASH data type in Dragonfly. When using hashes, vectors must be encoded in a binary format. For this purpose, we'll employ the numpy Python library. It's important to note that Dragonfly currently supports only the float32 data type. This means each vector should be encoded using 4 bytes per number.

import numpy as np

for i, post in posts.iterrows():
    embedding_bytes = np.array(post['embedding']).astype(np.float32).tobytes()
    client.hset(f"post-{i}", mapping={**post, 'embedding': embedding_bytes})

We've managed to set everything up with just a few lines of code! The final step involves converting user queries into vectors and then querying Dragonfly with these vectors. Note that in order to perform the following step, an OpenAI API key is required. Learn more about obtaining an API key here.

For vector similarity queries, a special syntax is used:

* => [KNN 3 @embedding $query_vector AS vector_score]

The * part represents the filter expression, which can limit the documents considered for the vector similarity search. Using just * selects all documents.
The number 3 specifies that the three closest vectors will be computed.
@embedding denotes the document field where the vectors are stored.
$query_vector is the parameter name containing the target vector.
AS vector_score indicates the name under which the vector distance will be returned.

import openai
from redis.commands.search.query import Query

# How to get an OpenAI API key: https://platform.openai.com/docs/api-reference/introduction
# NOTE: Do not share your API key with anyone, do not commit it to git, do not hardcode it in your code.
openai.api_key = "{YOUR_OPENAI_API_KEY}"
EMBEDDING_MODEL = "text-embedding-ada-002"

# Convert query text to vector using the OpenAI API.
query = "How to switch from a multi node redis setup to Dragonfly"
query_vec = openai.embeddings.create(input=query, model=EMBEDDING_MODEL).data[0].embedding

# Build a search query for Dragonfly.
query_expr = Query("*=>[KNN 3 @embedding $query_vector AS vector_score]").return_fields("title", "vector_score").paging(0, 30)
params = {"query_vector": np.array(query_vec).astype(dtype=np.float32).tobytes()}

# Execute the query and print results.
docs = client.ft("posts").search(query_expr, params).docs
for i, doc in enumerate(docs):
    print(i+1, doc.vector_score, doc.title)

# === Output ===
# 1 0.562158 Zero Downtime Migration from Redis to Dragonfly using Redis Sentinel
# 2 0.568551 Migrating from a Redis Cluster to Dragonfly on a single node
# 3 0.606661 We're Ready for You Now: Dragonfly In-Memory DB Now Supports Replication for High Availability

As shown above, with a few simple steps, we've managed to build a simple recommendation system using Dragonfly Search and OpenAI's embeddings. Given that LangChain is based on OpenAI and Vector Similarity Search (VSS) technologies, Dragonfly Search is compatible with it as well. This compatibility enhances the range of applications and functionalities Dragonfly Search can support, tapping into the advanced capabilities of Large Language Models (LLMs).

Querying JSON Documents

In this final part, we demonstrate how to build an issue tracker using Dragonfly. We'll be using JavaScript, one of the most commonly used programming languages. To simplify document management, we'll utilize the redis-om-node library, which provides an object-mapping interface for Node.js. Again, as Dragonfly is highly compatible with Redis, we can use the same library to interact with Dragonfly.

Let's take a look at a sample issue object:

let issue = {
  author: 'alice',
  title: 'Production error',
  created: 1701203321,
  tags: ['bug', 'important'],
  comments: [
    {
      author: 'bob',
      text: 'Wow, did this really happen?',
      created: 1701203648,
    },
    {
      author: 'caren',
      text: 'We should fix this immediately!',
      created: 1701203954,
    },
  ],
}

We'll store issue objects like above as JSON values within Dragonfly. The advantage of indexing JSON values is that a schema field can map to not just a root-level object field, but to an entire JSONPath. JSONPaths are incredibly useful for selecting values from nested structures and arrays.

Now, let's define our schema using redis-om:

import { createClient } from 'redis'
import { Schema, Repository, EntityId } from 'redis-om'

// Create client and connect to Dragonfly.
const dragonfly = createClient()
await dragonfly.connect()

// Build the schema.
const schema = new Schema(
  'issue',
  {
    author: { type: 'string', path: '$.author' },
    title: { type: 'text', path: '$.title' },
    created: { type: 'number', path: '$.created', sortable: true },

    tags: { type: 'string[]', path: '$.tags[*]' },
    participant: { type: 'string[]', path: '$..author' },

    num_comments: {
      type: 'number',
      path: 'length($.comments)',
      sortable: true,
    },
    last_updated: {
      type: 'number',
      path: 'max($.comments[*].updated)',
      sortable: true,
    },
  },
  { dataStructure: 'JSON' }
)

// Build repository using the schema and Dragonfly client.
let issueRepository = new Repository(schema, dragonfly)

// Create index for the repository.
try {
  await issueRepository.createIndex()
} catch (e) {
  console.log(e)
}

// Use the repository to save the 'issue' object we defined earlier into Dragonfly.
await issueRepository.save(issue)

Let's break down the schema definition:

The first few fields, author, title, and created, select values directly from the root-level object using the $.field syntax.
As each post may include multiple tags, the tags field is used to select an array.
To track all participants in an issue, including those who comment, we use the $..author JSONPath. This path selects the author fields from all objects, including comments.
The num_comments and last_updated fields illustrate the usage of simple aggregation functions within JSONPaths

With the schema in place and a few entries created, we can now leverage the query builder to formulate more intricate queries.

Imagine we want to create a dashboard for Alice's homepage on our issue tracker website. We can achieve this by selecting all issues authored by alice, tagged as important, and sorting them to display the most recently updated ones first.

// Search for issues:
//  - authored by 'alice'
//  - tagged as 'important'
//  - sort results by 'last_updated'
let issues = await issueRepository
  .search()
  .where('author')
  .equals('alice')
  .where('tags')
  .contains('important')
  .sortDescending('last_updated')
  .return.all()

console.log(issues)

As shown above, with storing JSON documents in Dragonfly, building index schema utilizing JSONPaths, and using the query builder, we can easily leverage Dragonfly Search capabilities to build applications that require complex data management.

Conclusion

Dragonfly Search represents a significant leap forward in data management and search capabilities for our in-memory data store. It blends the flexibility of traditional database queries with the advanced features of modern AI technologies. However, Dragonfly Search is currently in Beta. As Dragonfly Search progresses, our vision for its evolution is clear and ambitious. We recognize current limitations as opportunities for growth and innovation:

Faster Updates: Though query performance is robust, we are actively working on speeding up the update process.
GeoSearch: We will support the GEO field type and its related command options.
Command Options: More FT.CREATE and FT.SEARCH options will be supported.
Scoring and Full-Text Search: Implementing scoring mechanisms and full-text search functionalities are key objectives as well.

However, with existing features, we've already seen how Dragonfly Search simplifies complex tasks, from creating efficient indexes to harnessing the power of vector similarity searches with OpenAI embeddings. Our exploration into using Dragonfly for diverse applications, such as building a recommendation system or an issue tracker, demonstrates its versatility and ease of use. If you want to learn more about Dragonfly Search, please register for our Community Office Hours, where the team will give a technical presentation and take questions.

And as always, we encourage you to get started, dive in, experiment, and discover the full potential of Dragonfly Search in your own projects.

Appendix - Useful Resources

Dragonfly Search Documentation
Dragonfly v1.13 Release Notes
The OpenAI + vector search example is available in the dragonfly-examples repository.