Rate Limiting with Dragonfly

Below I'm going to provide a deep dive in to rate limiting algoriths and how we selected the right algorithm for rate limiting with Dragonly. If you want to jump to the section on using this API click here.

At my previous company, we faced the challenge of handling over 300K incoming requests per second. Although we successfully constructed a stateless auto-scaling system that could handle the volume, there were some backend services that struggled with the load. Consequently, we developed a rate limiter to manage the traffic to these services. Despite several iterations and multiple post-mortem analyses, our solution remained imperfect. In reality, creating a reliable rate limiter for high throughput workloads is a significant challenge.

This problem was not unique to our team. Rate limiters are used in a variet of use cases to regulate the rate of executing actions or accessing resources. Common use cases include:

API requests: Requests can be limited based on a specific number of requests at a given time to prevent overloading of a server or database.
Login attempts: The number of login attempts can be limited to a specific period to prevent brute force attacks.
Form submissions: To prevent spamming, the rate at which a user can submit a form can be limited.
Managing quotas: To limit the number of actions a customer can perform based on their service tier, to enforce usage-based pricing.
Resource allocation: To limit the rate at which different users or processes can access a resource for fair usage.

As developers, we all need to use or build a rate limiter at some point in our careers. There are many algorithms to choose from, let's look at some of most common ones first.

Rate Limiting Algorithms

Fixed Window Algorithm: The fixed window algorithm divides time into fixed intervals or windows and allows a maximum number of requests within each window. When the limit is reached, any further requests within the same window are throttled. This algorithm is simple to implement but can lead to uneven rate limiting, as clients can make many requests at the end of one window and the beginning of the next.
Sliding Window Algorithm: an improved approach to rate limiting that provides a more balanced and fair experience compared to the fixed window algorithm. Instead of dividing time into fixed intervals or windows, the Sliding Window algorithm considers a continuous rolling window of time
Token Bucket Algorithm: In this algorithm, tokens are generated at a fixed rate and are used to represent a request. Requests are allowed as long as tokens are in the bucket.
Leaky Bucket Algorithm: This algorithm is similar to the token bucket algorithm, but requests are stored in a bucket that leaks at a fixed rate instead of tokens. Requests are only allowed if the bucket is not full.

Each algorithm fits a slightly different use case. In the case of Dragonfly, we were looking for an algorithm that would be versatile enough to cover a wide range of use cases but will require low computational and memory overhead. We decided to implement a variation of the leaky bucket algorithm called the "Generic Cell Rate Algorithm" (GCRA). Now, let's examine the specifics of this algorithm.

Generic Cell Rate Algorithm (GCRA)

Although its name can be confusing ("Cell Rate"), GCRA is a form of a leaky bucket algorithm that was used to regulate traffic in a communication technology called Asynchronous Transfer Mode (ATM). In ATM, data was split into fixed-size packets called "cells" (Hence the name). GCRA was used by ATM network's scheduler to control the flow of "cells". If too many cells came in too quickly, GCRA would delay or drop them out.

In GCRA “emission interval” (T) represents the rate at which the bucket is drained. Each request has a "cost" which is a multiplier of the "emission interval" to represent the increment time units to be added to the bucket. GCRA uses a single time variable called “theoretical arrival time” (TAT) to mark the next time a request can arrive without being rejected.

TAT is seeded on the first request to be the arrival time of the request plus a request cost. When a subsequent request arrives, we calculate the earliest arrival time of a request by subtracting burst capacity (τ + T) from TAT. If the earliest arrival time is in the past, we allow the request. Otherwise, the request is rejected. After a successful request, a new TAT is calculated by adding the request cost (fill the bucket).

The description is confusing enough, and a code sample will not make it much more clear, so instead, let's look at a visual abstraction.

Using CL.THROTTLE with Dragonfly

Dragonfly is an in-memory datastore that scales vertically to support millions of operations per second and terabyte-sized workloads on a single instance. It is fully compatible with the Redis ecosystem and requires no code changes to implement. Using Dragonfly's CL.THROTTLE command (shout out to the redis-cell project that origingally added this commannd for Redis) you can run various rate limiter configurations to best suit your use case.

The CL.THROTTLE Command

CL.THROTTLE emailGW 20 120 60 1
               ▲     ▲  ▲  ▲  ▲
               |     |  |  |  └──── apply 1 token
               |     |  └──┴─────── 120 tokens / 60 seconds
               |     └───────────── 20 Capacity
               └─────────────────── key "emailGW"

Let's take a look at each of the parameters in the CL.THROTTLE emailGW 20 120 60 1 command. emailGW is a bucket with a capacity of 20 leaking at the rate of 120 unites per 60 seconds (2 unites per second). A single token is applied against the rate limit of the key emailGW.

Key: The key name (emailGW) is used to identify the rate limiter in Dragonfly. It can be any string you choose and is used to store the state of the rate limiter.

Capacity: Represents the bucket capacity. Or the maximum number of units allowed in a time window. In this case, 20.

Leak rate: This parameter sets the rate at which the bucket leaks, which is the number of units (requests) leaking from the bucket on every time window. In this case, the rate is 120 units per 60 seconds (two units every second).

Time window: This parameter sets the time window for the rate limiter. In this case, the time window is 60 seconds.

Apply Unit: This parameter sets the cost of the request. In this case, a single unit (1) should be applied against the rate limit of the key. If omitted, the default value is 1.

Response

127.0.0.1:6379> CL.THROTTLE emailGW 20 120 60 1
1) (integer) 0   ──── limited?
2) (integer) 21  ──── RateLimit-Limit
3) (integer) 20  ──── RateLimit-Remaining
4) (integer) -1  ──── Retry-After
5) (integer) 0   ──── RateLimit-Reset

Let's look at the meaning of each item in the result array.

Limited: This is a binary value (0 or 1). If the returned value is 1, it indicates that the request is limited (throttled). If it's 0, the request is not limited, and the client can proceed.
RateLimit-Limit: The total limit of the key (max_burst +1).
RateLimit-Remaining: The remaining limit for the key.
Retry-After: If the request is limited (throttled), this value indicates the number of seconds the client should wait before retrying the request. If the request is not limited, this value will be -1.
RateLimit-Reset: This value represents the number of seconds remaining until the token bucket is fully drained, allowing the client to make a burst of requests up to the capacity limit.

Example

I assume you have a Dragonfly instance running. If not, please refer to the getting started guide to get a Dragonfly running on your machine. Also, ensure you are connected to Dragonfly using the redis-cli tool.

Let's start with a simple example where you have an application that accepts new user registrations. Suppose, after each registration, you have to send a welcome email, where the email gateway has a 120 emails per minute limit.

A trivial use of CL.THROTTLE will look like this (with capacity = 0).

127.0.0.1:6379> CL.THROTTLE emailGW 0 120 60
1) (integer) 0
2) (integer) 1
3) (integer) 0
4) (integer) -1
5) (integer) 2

In this case, a request is allowed every 0.5 seconds. Any additional request during the 0.5 seconds will be rejected. This is nice, but in most cases, we are not getting an email precisely every 0.5 seconds. This "Discrete" behavior can be useful in cases where the operation occupies the CPU for 0.5 seconds. But in most cases, we would like a more smooth behavior. For this, we can use the capacity parameter.

Allow burst

The capacity parameter allows us to define the number of units the limiter will accept in the given window without limiting when the requests are coming. In the following example, we allow up to 10 requests to run simultaneously.

127.0.0.1:6379> CL.THROTTLE emailGW 10 120 60
1) (integer) 0
2) (integer) 11
3) (integer) 10
4) (integer) -1
5) (integer) 0

Note: In this configuration, we theoretically allow 130 requests per 60 seconds as the bucket capacity is 10 leaking at a rate of 120 units per minute. To smooth this behavior, you can configure CL.THROTTLE emailGW 10 1200 600. (Now it will allow a max of 121 requests per minute and not 130)

With request cost

Not all requests are equal. Some take more resources than others. For example, an HTTP server can count a simple GET request as a single unit but a more complex "Search" request as two units or more. The "Apply units" parameter allows to assign a cost factor to every request. In the next example, we apply a cost of 2 units for emails with attachments, so 120 emails could be sent per minute but only 60 emails with attachments (or 90 emails and 15 with attachments)

127.0.0.1:6379> CL.THROTTLE emailGW 10 120 60 2
1) (integer) 0
2) (integer) 11
3) (integer) 9
4) (integer) -1
5) (integer) 1

What if we keep the same key and change the other parameters?

Rate limiting parameters are provided with every call to CL.THROTTLE so that limits can easily be reconfigured on the fly. In most cases, you will use the exact same parameters (configuration) for a given key. But in some cases, like with the usage of apply units, requests for the same key with a different configuration makes sense.

Conclusion

Rate limiters are an essential part of software, especially software that deals with heavy traffic, in order to regulate resource access and prevent overloading or spamming. We added CL.THROTTLE to Dragonfly to simplify the work of developers and allow you to create realtime applications easily while protecting your users and backend resources.

The IETF is currently publishing a standard for HTTP rate-limiting headers to be used with RESTful API servers, for example. So maybe in my next post, I will create a real example using the Dragonfly rate limiter for a RESTful API server. Stay tuned!