Rate Limiting • Adesh Gupta

You might have come across messages like the one below, while using services on the web.

In this blog we will understand what it is, why engineers at OpenAI (and others) do this to you, and how can you do the same to your users. 😬

What is rate limiting?

They are rules like:

10 login attempts per hour per IP
1,000 API calls per day per API key
Free: 100 requests/hour, Pro: 10,000 requests/hour

And so on.

Why use it?

The general idea is to protect systems and keep things fair.

Here are a few reasons why most of the production systems have some form of rate limiting.

Protect system stability – stops too many requests from crashing servers or databases.
Prevent abuse and attacks – makes it harder to brute-force logins or spam the system.
Keep things fair – prevents one user from using too much and slowing down others.
Control costs – avoids unexpected cloud or API bills caused by excessive requests.
Fail safely – returns a clear error instead of timing out or breaking the system.

And so on.

How can I build one?

There are several algorithms used for rate limiting. Today we will cover a brief explanation, interactive playground and implementation details for five prominent algorithms.

By the end of this blog, you will not only know how these algorithms work, but also will have an idea of which one you should use for our specific use case.

Let's understand them one by one.

Token bucket

This algorithm has a close resemblance to reality. To understand this we need to be at a bathroom 😅.

So you go to your bathroom, you open the tap, slide the bucket in and wait till there is enough water for you to fill in your mug, you fetch the water, use it, meanwhile more water keeps on flowing in the bucket. You can keep fetching the water as long as there is enough to fill the mug.

In this setup:
What happens when your water fetching speed is > the water filling in speed: Eventually you will have to wait for more water to flow in before you can use it.

And what happens when you don't fetch the water for a long time, and the tap is on: Once the bucket is filled all the excess water overflows out.

This is the exact principle token bucket algorithm works on. You just replace water with tokens.

You need tokens to access any resource.
For example each time you hit the login endpoint, you consume one token. As long as there are tokens in the bucket, your request is allowed. But when the bucket is empty your request is rate limited.

Also there is a limit to the amount of tokens you can accumulate, just like with the bucket and water example.

Note

In this world you can't fill and fetch partial mug, once you fetch water, you take out 1 token (1 full mug of water)

Playground

Here's a playground to help better understand how the algorithm works.

Play around : turn the tap on, hit some request, try to get some requests dropped, some tokens to overflow

Rate

Accepted

Dropped

MAX: 5

Now that you have a good idea of how the algorithms work, let's discuss some implementation details

Implementation

The implementation is not as real world though, we don't keep a timer running and drop tokens continuously for each user and each resource, that would be highly inefficient.

Instead we lazily fill the bucket, when the user makes a request

Using the Redis hash data structure, we just store 2 values for each user / identifier:

Number of tokens left
Last refill time

And based on our requirements we tune the values for bucket size and fill rate.

To decide whether we should accept or reject a request, we do simple maths each time a request comes in:

We fetch the 2 values for the user:

const bucket = await redis.hmget(redisKey, "tokens", "last_refill")
// here redisKey can look like: limiter:login:127.0.0.1

Calculate the number of tokens that would have dropped since the last refill

const delta = Math.max(0, now - lastRefill)
const tokensToAdd = delta * ratePerMs

Add these tokens to bucket:

// Add tokens, but don't exceed capacity
let tokens = Math.min(capacity, tokens + tokensToAdd)

Now after updating the bucket stats, if we have enough tokens the request is accepted, else it is rejected.

You can look at the complete implementation here

Pros & Cons

Pros:

Handles bursts naturally: if the bucket is full, you can consume all tokens instantly. This is great for APIs where occasional spikes are expected (like a user uploading multiple files at once).
Simple to understand and implement: the mental model (tokens = credits) is intuitive.
Memory efficient: just 2 values per user (token count + last refill time).

Cons:

Requires careful tuning: you need to balance bucket size (capacity) and refill rate. Set them wrong, and you either block legitimate users or fail to prevent abuse.
Can allow bursts that overwhelm your system: if someone saves up 1000 tokens, they can fire 1000 requests instantly, which might still crash your downstream services even though the rate limiter "allowed" it.
Not intuitive for users: explaining "you have 47.3 tokens left" is harder than saying "you can make 50 more requests this hour."

TL;DR

The next section is a bit of rabbit hole, on how the current implementation can have race conditions since our redis updates are not atomic.

To solve this issue, we move all the redis based interaction in a lua script.

From now on, we will discuss lua based implementation of all the algorithms, you can find the source code with setup instructions.

However if you are curious, about the setup you can continue reading, if not skip directly to the leaky bucket section.

Why Lua?

Imagine you have 1 token left in your bucket. Two users, Alice and Bob, hit your API at the exact same millisecond.

Both requests read 1 token is left in the bucket, then both do the math (subtract the token) and write 0 token to the bucket.

Result: You processed 2 requests, but the database says you only spent 1 token. In a high-traffic system (like a ticket sale), this allows users to sneak past limits.

The problem here is that our read-calculate-write cycle is not atomic.

We can solve this problem using a lua script. The idea is to off-load the read-calculate-write cycle from the application server code to a lua script.

Since Redis is single-threaded. When a Lua script runs, it blocks everything else. By using redis.call inside Lua, the read-calculate-write cycle is safe even if 1000 requests hit at the exact same millisecond.

So now, even if Alice and Bob request at the same time, only one of their requests will be executed, and the latter will read the updated token count.

But Wait!

Since Redis is single-threaded, and we are blocking all requests while one is getting processed, won't this be a bottleneck?

Yes, it blocks, but only for the execution time, not the network time.

To understand why this is safe at scale, we must distinguish between two types of "time":

Network Round Trip (~20ms): The time it takes for the request to travel from your Node.js server to Redis and back. Redis is NOT blocked during this travel time. It is free to serve other clients while your request is "on the wire."
Script Execution (~5µs): The actual time Redis spends calculating the math inside the Lua script. Redis is blocked here.

Because the script runs in memory and performs simple math (O(1)), the "blocked" window is microscopic (approx. 5 microseconds).

The Cluster Advantage Furthermore, if you are running a Redis Cluster, this blocking behavior is isolated to a single shard.

If User A is mapped to Node 1, their Lua script blocks Node 1 for 5µs.
User B, who is mapped to Node 2, is completely unaffected and continues to be served instantly.

Redis Cluster

If you run Redis Cluster, this script only blocks the single shard holding that specific key.

If multiple users are accessing the same rate limit key (e.g., a global limit), they will all route to the same node and be processed sequentially, preserving safety.

If they are accessing different keys (e.g., per-user limits), they will likely be distributed across different nodes, running in parallel.

Summary:

Math: Simple subtraction/multiplication (now - last). CPU cycles: Nanoseconds.
Data Fetch/Write: HMGET/HMSET. RAM Access: Nanoseconds.
Scope: Blocks only the specific shard, not the whole cluster.

Caution

However this does not mean that any logic inside a Lua script is fast and we should not care. Here are some general rules while creating a Lua script:

No Loops over large datasets: Never do for key in redis.scan() inside a Lua script. That is O(N) and will freeze the server for seconds.
No External Calls: You cannot call an external API or Database from inside Redis Lua.
No Heavy Crypto: Don't try to calculate hashes (like bcrypt) inside Redis Lua.

Leaky bucket

For this algorithm assume, that we have a bucket with a hole at the bottom. When you make a request, you are adding water to the bucket, which is leaked at a constant rate. If the amount of water/request overflow the bucket capacity it overflows and the request is dropped.

Playground

Hit the request button, getting the bucket to fill, and drop some requests

Leak Rate

Next Leak

MAX: 5

Accepted

Dropped Requests

Implementation

Just like the Token Bucket, we do not use a background timer to constantly "drip" water out of the bucket every millisecond. That would be inefficient at scale. Instead, we calculate the leak lazily, only when a new request arrives.

We assume that between the last request and now, the bucket has been leaking correctly. We simply calculate how much should have leaked in that time gap and update the state instantly.

Here also we use the Redis hash data structure for each user / identifier to store 2 values:

Water level
Last updated time

And tune 2 values based on our requirements bucket size, and leaking rate.

For each request:

We get the water_level and last updated values

redis.call("HMGET", key, "water_level", "last_updated")

Calculate how much leaked since last time

local elapsed = math.max(0, now - last_updated)
local leaked_amount = elapsed * rate

Update water level (cannot be negative)
water_level = math.max(0, water_level - leaked_amount)
Now we add the request to the updated current water level, if it is within the bucket's capacity, the request is accepted else rejected.

You can look at the complete implementation here.

Pros & Cons

Pros:

Smooth, predictable output: processes requests at a constant rate, preventing sudden traffic spikes from hitting your backend. This is ideal when downstream services can't handle bursts (e.g., third-party APIs with strict rate limits).
Fair processing: requests are handled in order (FIFO), which can be important for things like payment processing or job queues.
Memory efficient: like Token Bucket, just 2 values per user.

Cons:

No burst tolerance: even if the bucket is empty and your system is idle, users still can't send more than the leak rate allows. This feels restrictive compared to Token Bucket.
Can cause artificial delays: legitimate users might experience slow response times because the algorithm enforces a constant drip rate, even when the system has capacity.
Harder to tune: you need to match the leak rate to your actual system capacity, which requires deeper understanding of your backend throughput.

For the next set of algorithms we need some fresh air! Let's move from buckets to windows. 💨

Fixed Window

For fixed window algorithm we define a window duration say 5s, and number of requests allowed in each window say 5.

So we have rules like: We can hit 5 requests in each 5s window.

If the amount of request goes past the limit in the current window they are dropped. A new window starts after the window duration is over.

Playground

Play around. Try to hit the window limits.

WINDOW 10/5

WINDOW 20/5

WINDOW 30/5

0.0s

Total Allowed

Total Dropped

Capacity: 5 requests per window

Implementation

This is one of simplest algorithms to implement.

Here we just use a key values pair, to keep the count of number of accepted request in the current window.

If this is the first request, we create the window and add TTL for window duration.

-- 1. Increment the counter
local current_count = redis.call("INCR", key)

-- 2. If this is the first request in this window, set the expiry
if current_count == 1 then
    redis.call("PEXPIRE", key, window_size)
end

Do the math for counter: If the number of request in the window is less than capacity: allowed else rejected.

After the window duration the key is deleted (thanks to TTL). For new request we again create a new key (window).

Complete implementation.

Pros & Cons

Pros:

Extremely simple: just increment a counter and check against a limit. No complex math or state management.
Very memory efficient: only 1 integer per user per window.
Fast: O(1) operations (increment + compare). Can handle millions of requests per second.
Easy to explain to users: "You can make 100 requests per hour" is straightforward.

Cons:

The boundary problem: users can send 2x the intended limit by clustering requests at window edges (e.g., 50 requests at 11:59:59 and 50 more at 12:00:01 = 100 requests in 2 seconds, even though the limit is "50 per minute").
Unfair to users who spread requests: someone making steady requests throughout the hour might get blocked, while someone who waits until the last second can burst through.
Resets can feel arbitrary: if you hit your limit at 10:05, you have to wait until 11:00 for a reset (even though you've been blocked for 55 minutes).

Try to game the system

In the playground above try to send 2x the limit by hitting the requests at the edges of two windows.

We just saw that the Fixed Window algorithm is simple but has a "double burst" flaw (allowing 2x traffic at the window boundary). To solve this with perfect precision, we use the Sliding Window Log algorithm.

Sliding window log

Instead of just counting "5 requests", we store the timestamp of all the accepted request in a log.

When a new request comes in, we look back exactly one window width (e.g., 60 seconds) and count how many logs fall in that range.

If the amount of requests in the log is less than the capacity, the request is accepted else rejected.

Playground

Here's the playground, hit some requests, try to understand why replicating the "double burst" issue here is not possible.

SLIDING WINDOW0/5

Current Time

Window Limit

Allowed

Dropped

Sliding Window Log • Capacity: 5

Implementation

The implementation relies entirely on the Redis Sorted Set (ZSET).

A Sorted Set is like a standard Set (unique items), but every item has a Score attached to it. Redis sorts the items based on this score.

Here we store 2 information:

The Member (Value): The unique ID of the request.
The Score: The timestamp (in milliseconds).

By using the Timestamp as the Score, we can ask Redis questions like "Remove everything with a score less than X".

We need Lua to perform the Clean -> Count -> Add sequence atomically.

The flow for each request:

Calculate the start of the window
local clear_before = now - window_size

Remove all elements older than the window start

-- This is the "Sliding" part: dropping the tail of the window
redis.call("ZREMRANGEBYSCORE", key, "-inf", clear_before)

Count how many requests are currently in the window (only counting the accepted request)
local current_count = redis.call("ZCARD", key)

Check if allowed (compare the current_count with the limit):

local allowed = 0
if current_count < limit then
    allowed = 1
    -- Add the current request to the set
    -- Score = now, Member = unique_id
    redis.call("ZADD", key, now, unique_id)
    
    -- Set TTL to clean up if the user stops sending requests
    redis.call("PEXPIRE", key, window_size + 1000)
    
    current_count = current_count + 1
end

Why Unique Id?

In a Redis ZSET, every member must be unique.

If we simply used the timestamp now as the member, two requests arriving at the exact same millisecond would overwrite each other (counting as 1 request instead of 2).

To fix this, our middleware generates a unique identifier for every request:

const uniqueId = `${now}-${crypto.randomBytes(4).toString("hex")}`

Complete implementation.

Pros & Cons

Pros:

Perfectly accurate: no approximations or edge cases. If you say "100 requests per minute," users get exactly 100 requests in any rolling 60-second window.
No boundary exploitation: unlike Fixed Window, there's no way to game the system by clustering requests at window edges.
Fair distribution: every user's limit is continuously enforced based on actual recent activity, not arbitrary reset times.

Cons:

Memory intensive: for high limits (e.g., 5000 req/min), you're storing 5000 timestamps per user. With 10k concurrent users, that's 50 million entries in Redis.
Computationally expensive: every request requires cleaning old entries (ZREMRANGEBYSCORE) and counting (ZCARD), which involves O(log N) operations (where N = number of requests). This is heavier than simple counters.
Vulnerable to attacks: malicious actors can exhaust your Redis memory by creating millions of rate limit keys, each storing thousands of timestamps.

Sliding window counter

We've seen that the Sliding Window Log is precise but memory-heavy O(N), while the Fixed Window is efficient O(1) but inaccurate near window boundaries.

To get the best of both worlds, we use the Sliding Window Counter (often called the Hybrid or Weighted approach).

Instead of storing a timestamp for every request, we simply count requests in fixed time buckets (just like Fixed Window).

However, to solve the "boundary" issue, we look at two windows: the Current Window and the Previous Window.

We estimate the current load using a weighted formula:

\text{Weighted Count} = \text{Current Count} + \left( \text{Previous Count} \times \frac{\text{Time Remaining in Previous Window}}{\text{Window Size}} \right)

Since we just store the counter and not the timestamps of the request, we assume that all the request in the previous window were evenly distributed.

This way we get a pretty good estimation with just storing the counter for each window. Now this assumption can lead us to inaccurate results when the requests in the previous window are skewed at one end.

But it averages out and works most of the time.

Good Bet

According to experiments done by Cloudflare, only 0.003% of requests are wrongly allowed or rate limited among 400 million requests when using Sliding window counter algorithm.

Playground

Make some request, see the weighted count update real time and try to make sense of it.

ESTIMATE =

(Prev: 0 × 100%)+(Curr: 0)

=0.00

PREVIOUS WINDOW

1.00x Weight

CURRENT WINDOW

NEXT WINDOW

SLIDING SCOPE (APPROX)

Current Time

Prev Window

0Weight: 100%

Curr Window

0Weight: 100%

Est. Total

0.0Capacity: 5

Request Status

ALLOW

Note : We only consider the accepted requests from previous window

Implementation

Here we just store 2 keys, one for the current window count and one for the previous.

For each request:

Calculate how much percentage of the current window is elapsed:
local elapsed_in_window = now % window_size

Calculate the weighted count

local overlap_factor = (window_size - elapsed_in_window) / window_size
local weighted_count = current_count + (previous_count * overlap_factor)

Calculate if the request is allowed and update the values

local allowed = 0
if weighted_count < limit then
    allowed = 1
    
    -- Increment the current window counter
    redis.call("INCR", current_key)
    
    -- Set Expiry
    -- We need the data to last for at least 2 windows (current + when it becomes previous)
    -- Adding a small buffer (+1000ms) to be safe.
    redis.call("PEXPIRE", current_key, (window_size * 2) + 1000)
    
    -- Update the return estimate to reflect this new request
    weighted_count = weighted_count + 1
end

Complete Implementation

Pros & Cons

Pros:

Best of both worlds: combines Fixed Window's efficiency (O(1) operations, minimal memory) with nearly the accuracy of Sliding Window Log (99.997% correct according to Cloudflare).
Industry proven: battle-tested by companies like Cloudflare, GitHub, and Stripe handling billions of requests daily.
Memory efficient: just 2 counters per user (current + previous window), regardless of traffic volume.
No boundary exploitation: the weighted formula smooths out traffic at window edges, preventing the "double burst" problem.
Scales beautifully: whether your limit is 10 or 10 million requests, the cost remains constant.

Cons:

Slight inaccuracy: assumes requests in the previous window were evenly distributed. If someone sent all 1000 requests in the last second of the previous window, the algorithm might over-allow in the current window.
Edge case errors: according to Cloudflare's experiments, 0.003% of requests (roughly 3 in 100,000) are incorrectly allowed or blocked. This is negligible for most use cases but could matter in high-stakes scenarios like financial transactions.
Less intuitive to debug: when users ask "why was I rate limited?", explaining weighted counts is harder than pointing to a simple counter.

Why is sliding window log expensive?

Don't we only store the log information for the accepted requests?

Yes we only store accepted requests, but the inefficiency scales poorly compared to the Sliding Window Counter.

Here's why

1. Space Complexity: The Scaling Trap

Even with a capped window, memory usage is drastically higher than a counter.

High-Throughput Scenarios:

For an API allowing 5,000 req/min:
- Counter: Stores 2 integers (Current + Previous window). Memory: ~100 bytes.
- Log: Stores 5,000 timestamps. Memory: ~500 KB - 1 MB per user.
- Impact: With 10,000 concurrent users, the Counter needs ~1MB RAM total; the Log needs ~10GB.
Attack Scenarios (Botnets):

If 100,000 unique IPs attack you:
- Counter: Creates 100,000 simple keys. Redis handles this trivially.
- Log: Creates 100,000 Sorted Set structures. The overhead of initializing and managing these complex structures consumes significantly more RAM and CPU than simple counters.

2. Compute Complexity: O(N) vs O(1)

Both algorithms must "compute" before rejecting a request, but the intensity of that compute differs fundamentally.

The Log (Heavy Compute):

To validate a request, the database must traverse the Sorted Set to remove old entries and count the current ones.
- Cost: O(log N) (Logarithmic)
- Issue: As your allowed traffic (N) grows, the cost to check validity grows slightly. It involves managing a complex data structure (Sorted Set) rather than a simple value.
The Counter (Light Compute):

To validate a request, the system retrieves two integers and applies a simple formula: (PrevCount * Weight) + CurrCount.
- Cost: O(1) (Constant).
- Benefit: Checking a limit of 1 million takes the exact same CPU time as checking a limit of 10. It fails fast and cheap.

Summary: The Log algorithm penalizes you for having higher limits (more CPU/RAM usage), whereas the Counter algorithm has a fixed cost regardless of the limits.

Choosing the Right Algorithm

Now that we've explored all five algorithms, let's break down when to use each one:

Token Bucket

Use when: You need to handle natural traffic bursts (file uploads, batch operations, user login after idle periods).

Best for: APIs with bursty workloads where occasional spikes are expected and safe.

Example: A photo upload API where users might upload 10 images at once after a vacation.

Avoid when: Bursts could overwhelm your backend (e.g., database writes, expensive computations).

Leaky Bucket

Use when: You need strict, predictable output rates to protect fragile downstream services.

Best for: Proxying third-party APIs with strict rate limits, job queues, or systems sensitive to traffic spikes.

Example: A service calling a payment gateway that explicitly requires "no more than 10 requests per second."

Avoid when: User experience matters more than backend protection—the forced delay can feel sluggish.

Fixed Window

Use when: Simplicity and speed are paramount, and slight inaccuracies are acceptable.

Best for: Internal services, low-stakes APIs, or scenarios where you just need "good enough" protection quickly.

Example: A dashboard limiting internal admin actions to 1000 per hour.

Avoid when: Users might exploit boundary issues, or you need fairness guarantees.

Sliding Window Log

Use when: You need absolute precision and have the resources to handle the memory/compute cost.

Best for: Financial APIs, security-critical systems, compliance scenarios, or low-limit high-value operations.

Example: An OTP verification endpoint allowing exactly 3 attempts per phone number per 5 minutes.

Avoid when: Handling high-throughput traffic (1000+ req/sec per user) or facing potential DDoS attacks.

Sliding Window Counter ⭐

Use when: You need production-grade rate limiting for high-traffic systems (the default choice for most engineers).

Best for: Public APIs, SaaS platforms, user-facing services, or anywhere you need to balance accuracy, performance, and scalability.

Example: A REST API serving thousands of users with limits like "1000 requests per hour."

Avoid when: You need absolute precision (use Sliding Window Log) or are building something quick-and-dirty (use Fixed Window).

Final Thoughts

If you're still unsure, start with Sliding Window Counter. It's what the industry has converged on for good reason—it handles 99% of use cases with minimal trade-offs.

That said, no algorithm is perfect. Here is a quick decision tree I follow:

Do you need perfect accuracy?
├─ YES → Sliding Window Log (but watch your memory!)
└─ NO → Continue...

Is your backend fragile or rate-limited itself?
├─ YES → Leaky Bucket (smooth output rate)
└─ NO → Continue...

Do you expect natural traffic bursts?
├─ YES → Token Bucket (allows accumulated credits)
└─ NO → Continue...

Is this a quick prototype or internal tool?
├─ YES → Fixed Window (simplest implementation)
└─ NO → Sliding Window Counter (production default)

That's all for this blog. Hope you enjoyed it! 👋