Design A Rate Limiter

In a network system, a rate limiter is used to control the rate of traffic sent by a client or a service. In the HTTP world, a rate limiter limits the number of client requests allowed to be sent over a specified period. If the API request count exceeds the threshold defined by the rate limiter, all the excess calls are blocked. Here are a few examples:

A user can write no more than 2 posts per second.
You can create a maximum of 10 accounts per day from the same IP address.
You can claim rewards no more than 5 times per week from the same device.

In this chapter, you are asked to design a rate limiter. Before starting the design, we first look at the benefits of using an API rate limiter:

Prevent resource starvation caused by Denial of Service (DoS) attack [1]. Almost all APIs published by large tech companies enforce some form of rate limiting. For example, Twitter limits the number of tweets to 300 per 3 hours [2]. Google docs APIs have the following default limit: 300 per user per 60 seconds for read requests [3]. A rate limiter prevents DoS attacks, either intentional or unintentional, by blocking the excess calls.
Reduce cost. Limiting excess requests means fewer servers and allocating more resources to high priority APIs. Rate limiting is extremely important for companies that use paid third party APIs. For example, you are charged on a per-call basis for the following external APIs: check credit, make a payment, retrieve health records, etc. Limiting the number of calls is essential to reduce costs.
Prevent servers from being overloaded. To reduce server load, a rate limiter is used to filter out excess requests caused by bots or users’ misbehavior.

Step 1 - Understand the problem and establish design scope

Rate limiting can be implemented using different algorithms, each with its pros and cons. The interactions between an interviewer and a candidate help to clarify the type of rate limiters we are trying to build.

Candidate: What kind of rate limiter are we going to design? Is it a client-side rate limiter or server-side API rate limiter?
Interviewer: Great question. We focus on the server-side API rate limiter.

Candidate: Does the rate limiter throttle API requests based on IP, the user ID, or other properties?
Interviewer: The rate limiter should be flexible enough to support different sets of throttle rules.

Candidate: What is the scale of the system? Is it built for a startup or a big company with a large user base?
Interviewer: The system must be able to handle a large number of requests.

Candidate: Will the system work in a distributed environment?
Interviewer: Yes.

Candidate: Is the rate limiter a separate service or should it be implemented in application code?
Interviewer: It is a design decision up to you.

Candidate: Do we need to inform users who are throttled?
Interviewer: Yes.

Requirements

Here is a summary of the requirements for the system:

Accurately limit excessive requests.
Low latency. The rate limiter should not slow down HTTP response time.
Use as little memory as possible.
Distributed rate limiting. The rate limiter can be shared across multiple servers or processes.
Exception handling. Show clear exceptions to users when their requests are throttled.
High fault tolerance. If there are any problems with the rate limiter (for example, a cache server goes offline), it does not affect the entire system.

Step 2 - Propose high-level design and get buy-in

Let us keep things simple and use a basic client and server model for communication.

Where to put the rate limiter?

Intuitively, you can implement a rate limiter at either the client or server-side.

Client-side implementation. Generally speaking, client is an unreliable place to enforce rate limiting because client requests can easily be forged by malicious actors. Moreover, we might not have control over the client implementation.
Server-side implementation. Figure 1 shows a rate limiter that is placed on the server-side.

Image represents a simplified client-server architecture with a rate limiter. A light-blue rectangular box labeled 'Client' contains icons representing a laptop and a smartphone, signifying that requests originate from various client devices. A thick, blue arrow labeled 'HTTP request' extends from the Client box to a group of three vertically stacked, light-green rectangular boxes representing 'API Servers,' which are enclosed within a dashed light-blue rounded rectangle. Adjacent to the API Servers, a white rectangular box labeled 'Rate limiter' is shown, suggesting that the rate limiter sits between the client requests and the API servers. The arrow indicates that the HTTP request flows from the client, through the rate limiter, and finally reaches the API servers. The arrangement visually depicts a system where the rate limiter manages the flow of HTTP requests from clients to prevent server overload. — Figure 1

Besides the client and server-side implementations, there is an alternative way. Instead of putting a rate limiter at the API servers, we create a rate limiter middleware, which throttles requests to your APIs as shown in Figure 2.

Image represents a simplified client-server architecture with a rate limiter. A light-blue rounded rectangle labeled 'Client' contains icons representing a laptop and a smartphone, symbolizing various client devices. A solid blue arrow originates from the Client box, indicating requests flowing towards a vertical rectangular component labeled 'Rate limiter'. This rate limiter acts as a gatekeeper, processing requests before they reach the servers. Another solid blue arrow extends from the rate limiter to a dashed light-blue rounded rectangle labeled 'API Servers'. Inside this rectangle are three vertically stacked green rectangles representing multiple API servers. The arrows indicate the unidirectional flow of requests from clients, through the rate limiter, and finally to the API servers for processing. The overall diagram illustrates a common system design pattern where a rate limiter is used to manage and control the number of requests reaching the backend servers, preventing overload and ensuring system stability. — Figure 2

Let us use an example in Figure 3 to illustrate how rate limiting works in this design. Assume our API allows 2 requests per second, and a client sends 3 requests to the server within a second. The first two requests are routed to API servers. However, the rate limiter middleware throttles the third request and returns a HTTP status code 429. The HTTP 429 response status code indicates a user has sent too many requests.

Image represents a system architecture diagram illustrating rate limiting in an API. A light-blue rounded rectangle labeled 'Client' contains icons representing a laptop and a mobile phone, signifying multiple client devices. Multiple solid blue arrows emanate from the Client, representing requests, and point towards a tall, dark-grey rectangle labeled 'Rate limiter.' Some requests successfully pass through the Rate limiter, indicated by the continuation of the arrows to a dashed light-blue rounded rectangle labeled 'API Servers,' which depicts three vertically stacked green server icons. However, one request is blocked by the Rate limiter, shown by a red 'X' and a dashed red arrow pointing back to the Client with the label '429: Too many requests,' indicating a HTTP 429 error response due to exceeding the rate limit. The diagram clearly shows the flow of requests from clients, their filtering by the Rate limiter, and the potential for rejection based on exceeding a defined request rate. — Figure 3

Cloud microservices [4] have become widely popular and rate limiting is usually implemented within a component called API gateway. API gateway is a fully managed service that supports rate limiting, SSL termination, authentication, IP whitelisting, servicing static content, etc. For now, we only need to know that the API gateway is a middleware that supports rate limiting.

While designing a rate limiter, an important question to ask ourselves is: where should the rate limiter be implemented, on the server-side or in a gateway? There is no absolute answer. It depends on your company’s current technology stack, engineering resources, priorities, goals, etc. Here are a few general guidelines:

Evaluate your current technology stack, such as programming language, cache service, etc. Make sure your current programming language is efficient to implement rate limiting on the server-side.
Identify the rate limiting algorithm that fits your business needs. When you implement everything on the server-side, you have full control of the algorithm. However, your choice might be limited if you use a third-party gateway.
If you have already used microservice architecture and included an API gateway in the design to perform authentication, IP whitelisting, etc., you may add a rate limiter to the API gateway.
Building your own rate limiting service takes time. If you do not have enough engineering resources to implement a rate limiter, a commercial API gateway is a better option.

Algorithms for rate limiting

Rate limiting can be implemented using different algorithms, and each of them has distinct pros and cons. Even though this chapter does not focus on algorithms, understanding them at high-level helps to choose the right algorithm or combination of algorithms to fit our use cases. Here is a list of popular algorithms:

Token bucket
Leaking bucket
Fixed window counter
Sliding window log
Sliding window counter

Token bucket algorithm

The token bucket algorithm is widely used for rate limiting. It is simple, well understood and commonly used by internet companies. Both Amazon [5] and Stripe [6] use this algorithm to throttle their API requests.

The token bucket algorithm work as follows:

A token bucket is a container that has pre-defined capacity. Tokens are put in the bucket at preset rates periodically. Once the bucket is full, no more tokens are added. As shown in Figure 4, the token bucket capacity is 4. The refiller puts 2 tokens into the bucket every second. Once the bucket is full, extra tokens will overflow.

Image represents a simplified system illustrating a resource management concept, possibly related to buffer management or rate limiting in system design. A vertical, dark-bordered rectangle depicts a source, containing three stacked, gold coin-like icons representing units of a resource. A downward-pointing arrow labeled 'refill...' connects this source to a dark-green bucket, symbolizing a storage or buffer. Inside the bucket are four stacked coins, indicating the existing resource level. A horizontal arrow labeled 'overfill...' extends from the bucket to a single coin outside the bucket, suggesting an overflow mechanism where excess resources beyond the bucket's capacity are expelled. The overall arrangement shows a resource being added to a limited-capacity container ('refill'), with excess resources being discarded ('overfill') when the container's capacity is exceeded. The text 'Viewer does not support full SVG 1.1' at the bottom is a browser-related message unrelated to the diagram's core meaning. — Figure 4

Each request consumes one token. When a request arrives, we check if there are enough tokens in the bucket. Figure 5 explains how it works.
If there are enough tokens, we take one token out for each request, and the request goes through.
If there are not enough tokens, the request is dropped.

Image represents a system for managing requests and tokens. A block labeled 'requests' sends requests to a decision point represented by a diamond labeled 'enough tokens?'. This decision point checks the level of tokens in a green bucket, which represents a token pool. Tokens are added to the bucket via a downward-pointing arrow labeled 'refill...', showing a stack of gold coins. If there are 'enough tokens?', a line labeled 'yes,...' leads to a block representing successful processing, showing a stack of gold coins and light-blue rectangles, likely representing processed requests. If there are not 'enough tokens?', a line labeled 'no,...' leads to a block indicating an error: 'Viewer does not support full SVG 1.1', suggesting a limitation in handling the request due to insufficient tokens. The overall flow depicts a token-based request processing system where requests are processed only if sufficient tokens are available in the pool, otherwise resulting in an error message. — Figure 5

Figure 6 illustrates how token consumption, refill, and rate limiting logic work. In this example, the token bucket size is 4, and the refill rate is 4 per 1 minute.

Image represents a diagram illustrating a token-based rate limiting system across four different time points. The diagram shows four numbered rectangular boxes, each representing a system's state at a specific time. Box 1 (1:00:00) shows an incoming 'request' arrow pointing to a box containing four coin icons, representing four available tokens. An outgoing arrow then points to Box 2. Box 2 (1:00:05) depicts three incoming 'request' arrows consuming three of the initial tokens (three coin icons remain). Three outgoing arrows then branch out. Box 3 (1:00:20) shows an incoming 'request' arrow pointing to an empty box (zero tokens), indicating a request that will be delayed or rejected due to token depletion. Finally, Box 4 (1:01:00) shows four tokens replenished, suggesting a refill mechanism after a certain time interval. The text below each box provides additional context, specifying the number of tokens available at the start and the outcome of the requests. The arrows represent the flow of requests, and the coins visually represent the available tokens within the rate limiting system. — Figure 6

The token bucket algorithm takes two parameters:

Bucket size: the maximum number of tokens allowed in the bucket
Refill rate: number of tokens put into the bucket every second

How many buckets do we need? This varies, and it depends on the rate-limiting rules. Here are a few examples.

It is usually necessary to have different buckets for different API endpoints. For instance, if a user is allowed to make 1 post per second, add 150 friends per day, and like 5 posts per second, 3 buckets are required for each user.
If we need to throttle requests based on IP addresses, each IP address requires a bucket.
If the system allows a maximum of 10,000 requests per second, it makes sense to have a global bucket shared by all requests.

Pros:

The algorithm is easy to implement.
Memory efficient.
Token bucket allows a burst of traffic for short periods. A request can go through as long as there are tokens left.

Cons:

Two parameters in the algorithm are bucket size and token refill rate. However, it might be challenging to tune them properly.

Leaking bucket algorithm

The leaking bucket algorithm is similar to the token bucket except that requests are processed at a fixed rate. It is usually implemented with a first-in-first-out (FIFO) queue. The algorithm works as follows:

When a request arrives, the system checks if the queue is full. If it is not full, the request is added to the queue.
Otherwise, the request is dropped.
Requests are pulled from the queue and processed at regular intervals.

Figure 7 explains how the algorithm works.

Image represents a system for processing requests using a rate-limiting mechanism. A group of incoming requests, depicted as several light-blue rectangles within a dashed box labeled 'requests,' are fed into a decision diamond labeled 'bucket full?'. If the bucket is not full ('no' branch), the requests enter a queue represented as a rectangle divided into sections, some filled light-blue and some white, labeled 'queue.' This queue suggests a buffer holding requests before processing. From the queue, requests are 'processed at a fixed rate,' indicated by an arrow leading to another group of light-blue rectangles within a dashed box labeled 'requests go through,' signifying processed requests. If the bucket is full ('yes' branch), the requests are directed to a grayed-out area at the bottom, labeled with the text 'Viewer does not support full SVG 1.1,' suggesting a rejection or alternative handling path due to a limitation. The overall diagram illustrates a system that manages incoming requests, buffering them in a queue, and processing them at a controlled rate to prevent overload. — Figure 7

Leaking bucket algorithm takes the following two parameters:

Bucket size: it is equal to the queue size. The queue holds the requests to be processed at a fixed rate.
Outflow rate: it defines how many requests can be processed at a fixed rate, usually in seconds.

Shopify, an ecommerce company, uses leaky buckets for rate-limiting [7].

Pros:

Memory efficient given the limited queue size.
Requests are processed at a fixed rate therefore it is suitable for use cases that a stable outflow rate is needed.

Cons:

A burst of traffic fills up the queue with old requests, and if they are not processed in time, recent requests will be rate limited.
There are two parameters in the algorithm. It might not be easy to tune them properly.

Fixed window counter algorithm

Fixed window counter algorithm works as follows:

The algorithm divides the timeline into fix-sized time windows and assign a counter for each window.
Each request increments the counter by one.
Once the counter reaches the pre-defined threshold, new requests are dropped until a new time window starts.

Let us use a concrete example to see how it works. In Figure 8, the time unit is 1 second and the system allows a maximum of 3 requests per second. In each second window, if more than 3 requests are received, extra requests are dropped as shown in Figure 8.

Image represents a time-series chart visualizing the number of successful and rate-limited requests over a five-minute interval. The horizontal axis represents time, marked at one-minute intervals from 1:00:00 to 1:00:04. The vertical axis represents the number of requests. Each rectangle represents a single request; light gray rectangles denote rate-limited requests, while white rectangles represent successful requests. The chart shows a fluctuating pattern: at 1:00:00, there are three successful requests; at 1:00:01, there are three successful requests and three rate-limited requests; at 1:00:02, there are three successful requests and one rate-limited request; at 1:00:03, there is one successful request; and at 1:00:04, there are three successful requests and two rate-limited requests. The legend clearly distinguishes between successful and rate-limited requests. The bottom right corner displays a message indicating that the viewer does not support full SVG 1.1. — Figure 8

A major problem with this algorithm is that a burst of traffic at the edges of time windows could cause more requests than allowed quota to go through. Consider the following case:

Image represents a time-series chart illustrating request distribution over a period of two minutes. A horizontal timeline runs from 2:00:00 to 2:02:00, marked with vertical lines at each minute. Within a dashed-line box spanning from approximately 2:00:30 to 2:01:30, light-blue rectangles represent individual requests. The rectangles are stacked vertically to show multiple requests occurring at roughly the same time. A total of ten requests are shown, indicated by a double-headed arrow above the box labeled '10 requests'. The requests are not uniformly distributed; some time intervals have more requests than others. Below the chart, the text 'Viewer does not support full SVG 1.1' suggests a possible limitation or constraint related to the data visualization. — Figure 9

In Figure 9, the system allows a maximum of 5 requests per minute, and the available quota resets at the human-friendly round minute. As seen, there are five requests between 2:00:00 and 2:01:00 and five more requests between 2:01:00 and 2:02:00. For the one-minute window between 2:00:30 and 2:01:30, 10 requests go through. That is twice as many as allowed requests.

Pros:

Memory efficient.
Easy to understand.
Resetting available quota at the end of a unit time window fits certain use cases.

Cons:

Spike in traffic at the edges of a window could cause more requests than the allowed quota to go through.

Sliding window log algorithm

As discussed previously, the fixed window counter algorithm has a major issue: it allows more requests to go through at the edges of a window. The sliding window log algorithm fixes the issue. It works as follows:

The algorithm keeps track of request timestamps. Timestamp data is usually kept in cache, such as sorted sets of Redis [8].
When a new request comes in, remove all the outdated timestamps. Outdated timestamps are defined as those older than the start of the current time window.
Add timestamp of the new request to the log.
If the log size is the same or lower than the allowed count, a request is accepted. Otherwise, it is rejected.

We explain the algorithm with an example as revealed in Figure 10.

Image represents a flowchart illustrating a rate-limiting system that allows only two requests per minute. The flowchart shows four stages (numbered 1-4). Each stage is a rectangular box representing a processing unit. Arrows indicate the flow of requests, with timestamps indicating the time of each request. Stage 1 receives a request at 1:00:01 and forwards it. Stage 2 receives this request at 1:00:30 (implying a processing delay) and forwards it. Stage 3 receives a request at 1:00:50 and forwards it. Stage 4 receives requests at 1:00:01 and 1:00:30 (highlighted in red to indicate they are within the rate limit), and then receives another request at 1:01:40, which is processed. The timestamps within each stage show the arrival times of requests at that stage. The title 'Allow 2 requests per minute' clarifies the system's constraint. The diagram demonstrates how the system handles requests within and exceeding the rate limit, showing that requests exceeding the limit are still processed but with a delay. — Figure 10

In this example, the rate limiter allows 2 requests per minute. Usually, Linux timestamps are stored in the log. However, human-readable representation of time is used in our example for better readability.

The log is empty when a new request arrives at 1:00:01. Thus, the request is allowed.
A new request arrives at 1:00:30, the timestamp 1:00:30 is inserted into the log. After the insertion, the log size is 2, not larger than the allowed count. Thus, the request is allowed.
A new request arrives at 1:00:50, and the timestamp is inserted into the log. After the insertion, the log size is 3, larger than the allowed size 2. Therefore, this request is rejected even though the timestamp remains in the log.
A new request arrives at 1:01:40. Requests in the range [1:00:40,1:01:40) are within the latest time frame, but requests sent before 1:00:40 are outdated. Two outdated timestamps, 1:00:01 and 1:00:30, are removed from the log. After the remove operation, the log size becomes 2; therefore, the request is accepted.

Pros:

Rate limiting implemented by this algorithm is very accurate. In any rolling window, requests will not exceed the rate limit.

Cons:

The algorithm consumes a lot of memory because even if a request is rejected, its timestamp might still be stored in memory.

Sliding window counter algorithm

The sliding window counter algorithm is a hybrid approach that combines the fixed window counter and sliding window log. The algorithm can be implemented by two different approaches. We will explain one implementation in this section and provide reference for the other implementation at the end of the section. Figure 11 illustrates how this algorithm works.

Image represents a graphical depiction of a rolling rate limiter. The horizontal axis represents time, divided into 'previous minute' and 'current minute' segments. The vertical axis represents the number of requests. A light-green rectangle labeled 'Rolling minute' spans across the boundary of the previous and current minutes, with 70% of its area in the previous minute and 30% in the current minute. This represents a rolling window of one minute. Within the rolling minute rectangle, several light-blue rectangles represent individual requests. A black rectangle encloses the requests within the previous minute, while a light-blue rectangle encloses the requests within the current minute. An arrow points from the top to the 'Rolling minute' rectangle, labeled 'Current time,' indicating the current position within the rolling window. A text annotation states 'Rate limit: 5 requests/min,' indicating the maximum allowed requests per minute. The diagram visually demonstrates how requests are counted within a rolling one-minute window to enforce the rate limit. — Figure 11

Assume the rate limiter allows a maximum of 7 requests per minute, and there are 5 requests in the previous minute and 3 in the current minute. For a new request that arrives at a 30% position in the current minute, the number of requests in the rolling window is calculated using the following formula:

Requests in current window + requests in the previous window * overlap percentage of the rolling window and previous window
Using this formula, we get 3 + 5 * 0.7% = 6.5 request. Depending on the use case, the number can either be rounded up or down. In our example, it is rounded down to 6.

Since the rate limiter allows a maximum of 7 requests per minute, the current request can go through. However, the limit will be reached after receiving one more request.

Due to the space limitation, we will not discuss the other implementation here. Interested readers should refer to the reference material [9]. This algorithm is not perfect. It has pros and cons.

Pros

It smooths out spikes in traffic because the rate is based on the average rate of the previous window.
Memory efficient.

Cons

It only works for not-so-strict look back window. It is an approximation of the actual rate because it assumes requests in the previous window are evenly distributed. However, this problem may not be as bad as it seems. According to experiments done by Cloudflare [10], only 0.003% of requests are wrongly allowed or rate limited among 400 million requests.

High-level architecture

The basic idea of rate limiting algorithms is simple. At the high-level, we need a counter to keep track of how many requests are sent from the same user, IP address, etc. If the counter is larger than the limit, the request is disallowed.

Where shall we store counters? Using the database is not a good idea due to slowness of disk access. In-memory cache is chosen because it is fast and supports time-based expiration strategy. For instance, Redis [11] is a popular option to implement rate limiting. It is an in-memory store that offers two commands: INCR and EXPIRE.

INCR: It increases the stored counter by 1.
EXPIRE: It sets a timeout for the counter. If the timeout expires, the counter is automatically deleted.

Figure 12 shows the high-level architecture for rate limiting, and this works as follows:

Image represents a system architecture diagram illustrating a client-server interaction with rate limiting. A rectangular box labeled 'Client' contains icons representing a laptop and a mobile phone, signifying various client devices. A solid blue arrow connects the Client to a light-blue hexagonal box labeled 'Rate limiter middleware,' indicating that client requests first pass through this rate-limiting component. From the rate limiter, another solid blue arrow points to a dashed-line box labeled 'API Servers,' depicted as three vertically stacked green server icons, representing the backend application servers handling client requests. Finally, a solid blue arrow extends from the rate limiter to a red, three-layered cube labeled 'Redis,' suggesting that the rate limiter uses Redis, an in-memory data store, to track and manage request rates. The arrows indicate the unidirectional flow of requests, with the API Servers potentially sending responses back through the rate limiter to the client (though this return path isn't explicitly shown). — Figure 12

The client sends a request to rate limiting middleware.
Rate limiting middleware fetches the counter from the corresponding bucket in Redis and checks if the limit is reached or not.
If the limit is reached, the request is rejected.
If the limit is not reached, the request is sent to API servers. Meanwhile, the system increments the counter and saves it back to Redis.

Step 3 - Design deep dive

The high-level design in Figure 12 does not answer the following questions:

How are rate limiting rules created? Where are the rules stored?
How to handle requests that are rate limited?

In this section, we will first answer the questions regarding rate limiting rules and then go over the strategies to handle rate-limited requests. Finally, we will discuss rate limiting in distributed environment, a detailed design, performance optimization and monitoring.

Rate limiting rules

Lyft open-sourced their rate-limiting component [12]. We will peek inside of the component and look at some examples of rate limiting rules:

domain: messaging
descriptors:
  - key: message_type
    value: marketing
    rate_limit:
      unit: day
      requests_per_unit: 5

In the above example, the system is configured to allow a maximum of 5 marketing messages per day. Here is another example:

domain: auth
descriptors:
  - key: auth_type
    value: login
    rate_limit:
      unit: minute
      requests_per_unit: 5

This rule shows that clients are not allowed to login more than 5 times in 1 minute. Rules are generally written in configuration files and saved on disk.

Exceeding the rate limit

In case a request is rate limited, APIs return a HTTP response code 429 (too many requests) to the client. Depending on the use cases, we may enqueue the rate-limited requests to be processed later. For example, if some orders are rate limited due to system overload, we may keep those orders to be processed later.

Rate limiter headers

How does a client know whether it is being throttled? And how does a client know the number of allowed remaining requests before being throttled? The answer lies in HTTP response headers. The rate limiter returns the following HTTP headers to clients:

X-Ratelimit-Remaining: The remaining number of allowed requests within the window.

X-Ratelimit-Limit: It indicates how many calls the client can make per time window.

X-Ratelimit-Retry-After: The number of seconds to wait until you can make a request again without being throttled.

When a user has sent too many requests, a 429 too many requests error and X-Ratelimit-Retry-After header are returned to the client.

Detailed design

Figure 13 presents a detailed design of the system.

Image represents a system architecture diagram illustrating API request handling and rate limiting. A client (represented by laptop and phone icons) sends requests to a Rate Limiter middleware. If the request is within the rate limits, the middleware labels it 'success' and forwards it to a CACHE, which contains 'Cached rules'. The CACHE forwards the request to a set of API Servers. If the request is not cached, the API Servers process it and store the result in Redis. If the request exceeds the rate limits, the middleware labels it 'rate limited' and the request is handled in one of two ways: option 1, the request is dropped (represented by a circle), or option 2, the request is sent to a Message queue for later processing. A 'Rules' box sits above the 'Workers' (which feed the CACHE), indicating that the rate limiting rules originate from this source. The system returns a '429: too many requests' error to the client if rate limiting is exceeded. — Figure 13

Rules are stored on the disk. Workers frequently pull rules from the disk and store them in the cache.
When a client sends a request to the server, the request is sent to the rate limiter middleware first.
Rate limiter middleware loads rules from the cache. It fetches counters and last request timestamp from Redis cache. Based on the response, the rate limiter decides:
if the request is not rate limited, it is forwarded to API servers.
if the request is rate limited, the rate limiter returns 429 too many requests error to the client. In the meantime, the request is either dropped or forwarded to the queue.

Rate limiter in a distributed environment

Building a rate limiter that works in a single server environment is not difficult. However, scaling the system to support multiple servers and concurrent threads is a different story. There are two challenges:

Race condition
Synchronization issue

Race condition

As discussed earlier, rate limiter works as follows at the high-level:

Read the counter value from Redis.
Check if (counter + 1) exceeds the threshold.
If not, increment the counter value by 1 in Redis.

Race conditions can happen in a highly concurrent environment as shown in Figure 14.

Image represents a diagram illustrating a race condition in a counter system. The diagram shows two independent requests, labeled 'request 1' and 'request 2,' each initiating a sequence of operations. Both requests first call a 'read_counter' function, which returns the current counter value (initially 3). Next, both requests proceed to a 'check_and_increment' function. The 'check_and_increment' function, in this example, appears to read the counter value, then increment it. Request 1 reads 3, increments to 4, and updates the counter. Request 2 concurrently reads 3, increments to 4, and updates the counter. The final counter value is 4, but the expected value after two increments from an initial value of 3 should be 5. This discrepancy highlights the race condition: the counter's value is not atomically updated, leading to an incorrect final result. The diagram visually depicts the flow of requests and the counter's value at each stage, clearly showing the concurrent operations and the resulting inconsistency. — Figure 14

Assume the counter value in Redis is 3. If two requests concurrently read the counter value before either of them writes the value back, each will increment the counter by one and write it back without checking the other thread. Both requests (threads) believe they have the correct counter value 4. However, the correct counter value should be 5.

Locks are the most obvious solution for solving race condition. However, locks will significantly slow down the system. Two strategies are commonly used to solve the problem: Lua script [13] and sorted sets data structure in Redis [8]. For readers interested in these strategies, refer to the corresponding reference materials [8] [13].

Synchronization issue

Synchronization is another important factor to consider in a distributed environment. To support millions of users, one rate limiter server might not be enough to handle the traffic. When multiple rate limiter servers are used, synchronization is required. For example, on the left side of Figure 15, client 1 sends requests to rate limiter 1, and client 2 sends requests to rate limiter 2. As the web tier is stateless, clients can send requests to a different rate limiter as shown on the right side of Figure 15. If no synchronization happens, rate limiter 1 does not contain any data about client 2. Thus, the rate limiter cannot work properly.

Image represents two different system architectures for handling rate limiting. The left side shows a simpler design where two clients (Client 1 and Client 2), each represented by a laptop and a mobile phone icon within a rounded rectangle labeled with the client number, independently connect to their respective rate limiters. Each rate limiter, depicted as a light-blue hexagon labeled 'Rate limiter 1' and 'Rate limiter 2' respectively, processes requests from its assigned client. A unidirectional arrow indicates the flow of requests from the client to the corresponding rate limiter. The right side illustrates a more complex architecture where the same two clients now connect to both rate limiters. Client 1 sends requests to both 'Rate limiter 1' and 'Rate limiter 2,' and similarly, Client 2 sends requests to both rate limiters. The connections are represented by unidirectional arrows showing the flow of requests from each client to both rate limiters. The dashed lines around each system suggest separate deployments or logical groupings. — Figure 15

One possible solution is to use sticky sessions that allow a client to send traffic to the same rate limiter. This solution is not advisable because it is neither scalable nor flexible. A better approach is to use centralized data stores like Redis. The design is shown in Figure 16.

Image represents a system architecture diagram illustrating rate limiting using Redis. Two clients, labeled 'Client 1' and 'Client 2,' each represented by icons of a laptop and a mobile phone within light-blue rectangular boxes, send requests. Each client's requests pass through a separate rate limiter, depicted as light-blue hexagons labeled 'Rate limiter 1' and 'Rate limiter 2' respectively. Thick blue arrows indicate the flow of requests. Both rate limiters then send information to a central Redis database, represented by a red, three-layered cube with geometric shapes on top, labeled 'Redis.' The arrows show that both rate limiters send data to the same Redis instance. The overall diagram shows a client-rate limiter-database architecture where the rate limiters manage the flow of requests from multiple clients to a shared Redis database, likely for managing request frequency and preventing abuse. — Figure 16

Performance optimization

Performance optimization is a common topic in system design interviews. We will cover two areas to improve.

First, multi-data center setup is crucial for a rate limiter because latency is high for users located far away from the data center. Most cloud service providers build many edge server locations around the world. For example, as of 5/20 2020, Cloudflare has 194 geographically distributed edge servers [14]. Traffic is automatically routed to the closest edge server to reduce latency.

Image represents a simplified map, possibly of a geographical region, showing the distribution of data points or resources. The map is primarily light blue, depicting landmasses with irregular boundaries. Scattered across these landmasses are numerous light purple circles with white outlines, suggesting locations of individual entities or data points. A single, larger light purple circle with a white outline is located on one of the landmasses, and a dark grey circle is situated to its left. A black arrow points from the grey circle directly towards the larger light purple circle, indicating a flow of information or a connection between these two points. The grey circle might represent a source or origin, while the larger purple circle could be a destination or central point receiving data. The smaller purple circles are distributed across both landmasses, suggesting a network or cluster of related entities. The overall arrangement suggests a client-server or data aggregation model, where data from multiple sources (smaller purple circles) might be collected and processed at a central location (larger purple circle), potentially originating from a specific source (grey circle). — Figure 17 (Source: [10])

Second, synchronize data with an eventual consistency model. If you are unclear about the eventual consistency model, refer to the “Consistency” section in the “Design a Key-value Store” chapter.

Monitoring

After the rate limiter is put in place, it is important to gather analytics data to check whether the rate limiter is effective. Primarily, we want to make sure:

The rate limiting algorithm is effective.
The rate limiting rules are effective.

For example, if rate limiting rules are too strict, many valid requests are dropped. In this case, we want to relax the rules a little bit. In another example, we notice our rate limiter becomes ineffective when there is a sudden increase in traffic like flash sales. In this scenario, we may replace the algorithm to support burst traffic. Token bucket is a good fit here.

Step 4 - Wrap up

In this chapter, we discussed different algorithms of rate limiting and their pros/cons. Algorithms discussed include:

Token bucket
Leaking bucket
Fixed window
Sliding window log
Sliding window counter

Then, we discussed the system architecture, rate limiter in a distributed environment, performance optimization and monitoring. Similar to any system design interview questions, there are additional talking points you can mention if time allows:

Hard vs soft rate limiting.
Hard: The number of requests cannot exceed the threshold.
Soft: Requests can exceed the threshold for a short period.
Rate limiting at different levels. In this chapter, we only talked about rate limiting at the application level (HTTP: layer 7). It is possible to apply rate limiting at other layers. For example, you can apply rate limiting by IP addresses using Iptables [15] (IP: layer 3). Note: The Open Systems Interconnection model (OSI model) has 7 layers [16]: Layer 1: Physical layer, Layer 2: Data link layer, Layer 3: Network layer, Layer 4: Transport layer, Layer 5: Session layer, Layer 6: Presentation layer, Layer 7: Application layer.
Avoid being rate limited. Design your client with best practices:
Use client cache to avoid making frequent API calls.
Understand the limit and do not send too many requests in a short time frame.
Include code to catch exceptions or errors so your client can gracefully recover from exceptions.
Add sufficient back off time to retry logic.

Congratulations on getting this far! Now give yourself a pat on the back. Good job!

Reference materials

[1] Rate-limiting strategies and techniques:
https://cloud.google.com/solutions/rate-limiting-strategies-techniques

[2] Twitter rate limits: https://developer.twitter.com/en/docs/basics/rate-limits

[3] Google docs usage limits: https://developers.google.com/docs/api/limits

[4] IBM microservices: https://www.ibm.com/cloud/learn/microservices

[5] Throttle API requests for better throughput:
https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-request-throttling.html

[6] Stripe rate limiters: https://stripe.com/blog/rate-limiters

[7] Shopify REST Admin API rate limits:
https://help.shopify.com/en/api/reference/rest-admin-api-rate-limits

[8] Better Rate Limiting With Redis Sorted Sets:
https://engineering.classdojo.com/blog/2015/02/06/rolling-rate-limiter/

[9] System Design — Rate limiter and Data modelling:
https://medium.com/@saisandeepmopuri/system-design-rate-limiter-and-data-modelling-9304b0d18250

[10] How we built rate limiting capable of scaling to millions of domains:
https://blog.cloudflare.com/counting-things-a-lot-of-different-things/

[11] Redis website: https://redis.io/

[12] Lyft rate limiting: https://github.com/lyft/ratelimit

[13] Scaling your API with rate limiters:
https://gist.github.com/ptarjan/e38f45f2dfe601419ca3af937fff574d#request-rate-limiter

[14] What is edge computing:
https://www.cloudflare.com/learning/serverless/glossary/what-is-edge-computing/

[15] Rate Limit Requests with Iptables: https://blog.programster.org/rate-limit-requests-with-iptables

[16] OSI model:
https://en.wikipedia.org/wiki/OSI_model#Layer_architecture

← Previous 6. Design a Unique ID Generator Next → 8. Design a URL Shortener