Rate Limiting

Feb 3, 2026

You might have come across messages like the one below, while using services on the web.

Rate limiting

In this blog we will understand what it is, why engineers at OpenAI (and others) do this to you, and how can you do the same to your users. 😬

What is rate limiting?

They are rules like:

  • 10 login attempts per hour per IP
  • 1,000 API calls per day per API key
  • Free: 100 requests/hour, Pro: 10,000 requests/hour

And so on.

Why use it?

The general idea is to protect systems and keep things fair.

Here are a few reasons why most of the production systems have some form of rate limiting.

  • Protect system stability – stops too many requests from crashing servers or databases.
  • Prevent abuse and attacks – makes it harder to brute-force logins or spam the system.
  • Keep things fair – prevents one user from using too much and slowing down others.
  • Control costs – avoids unexpected cloud or API bills caused by excessive requests.
  • Fail safely – returns a clear error instead of timing out or breaking the system.

And so on.

How can I build one?

There are several algorithms used for rate limiting. Today we will cover a brief explanation, interactive playground and implementation details for five prominent algorithms.

By the end of this blog, you will not only know how these algorithms work, but also will have an idea of which one you should use for our specific use case.

Let's understand them one by one.

Token bucket

This algorithm has a close resemblance to reality. To understand this we need to be at a bathroom πŸ˜….

So you go to your bathroom, you open the tap, slide the bucket in and wait till there is enough water for you to fill in your mug, you fetch the water, use it, meanwhile more water keeps on flowing in the bucket. You can keep fetching the water as long as there is enough to fill the mug.

In this setup:
What happens when your water fetching speed is > the water filling in speed: Eventually you will have to wait for more water to flow in before you can use it.

And what happens when you don't fetch the water for a long time, and the tap is on: Once the bucket is filled all the excess water overflows out.

This is the exact principle token bucket algorithm works on. You just replace water with tokens.

You need tokens to access any resource.
For example each time you hit the login endpoint, you consume one token. As long as there are tokens in the bucket, your request is allowed. But when the bucket is empty your request is rate limited.

Also there is a limit to the amount of tokens you can accumulate, just like with the bucket and water example.

Playground

Here's a playground to help better understand how the algorithm works.

Play around : turn the tap on, hit some request, try to get some requests dropped, some tokens to overflow

Rate
Accepted
0
Dropped
0
MAX: 5

Now that you have a good idea of how the algorithms work, let's discuss some implementation details

Implementation

The implementation is not as real world though, we don't keep a timer running and drop tokens continuously for each user and each resource, that would be highly inefficient.

Instead we lazily fill the bucket, when the user makes a request

Using the Redis hash data structure, we just store 2 values for each user / identifier:

  • Number of tokens left
  • Last refill time

And based on our requirements we tune the values for bucket size and fill rate.

To decide whether we should accept or reject a request, we do simple maths each time a request comes in:

We fetch the 2 values for the user:

const bucket = await redis.hmget(redisKey, "tokens", "last_refill")
// here redisKey can look like: limiter:login:127.0.0.1

Calculate the number of tokens that would have dropped since the last refill

const delta = Math.max(0, now - lastRefill)
const tokensToAdd = delta * ratePerMs

Add these tokens to bucket:

// Add tokens, but don't exceed capacity
let tokens = Math.min(capacity, tokens + tokensToAdd)

Now after updating the bucket stats, if we have enough tokens the request is accepted, else it is rejected.

You can look at the complete implementation here

Pros & Cons

Pros:

  • Handles bursts naturally: if the bucket is full, you can consume all tokens instantly. This is great for APIs where occasional spikes are expected (like a user uploading multiple files at once).

  • Simple to understand and implement: the mental model (tokens = credits) is intuitive.

  • Memory efficient: just 2 values per user (token count + last refill time).


Cons:

  • Requires careful tuning: you need to balance bucket size (capacity) and refill rate. Set them wrong, and you either block legitimate users or fail to prevent abuse.

  • Can allow bursts that overwhelm your system: if someone saves up 1000 tokens, they can fire 1000 requests instantly, which might still crash your downstream services even though the rate limiter "allowed" it.

  • Not intuitive for users: explaining "you have 47.3 tokens left" is harder than saying "you can make 50 more requests this hour."

Why Lua?

Imagine you have 1 token left in your bucket. Two users, Alice and Bob, hit your API at the exact same millisecond.

Both requests read 1 token is left in the bucket, then both do the math (subtract the token) and write 0 token to the bucket.

Result: You processed 2 requests, but the database says you only spent 1 token. In a high-traffic system (like a ticket sale), this allows users to sneak past limits.

The problem here is that our read-calculate-write cycle is not atomic.

We can solve this problem using a lua script. The idea is to off-load the read-calculate-write cycle from the application server code to a lua script.

Since Redis is single-threaded. When a Lua script runs, it blocks everything else. By using redis.call inside Lua, the read-calculate-write cycle is safe even if 1000 requests hit at the exact same millisecond.

So now, even if Alice and Bob request at the same time, only one of their requests will be executed, and the latter will read the updated token count.

Leaky bucket

For this algorithm assume, that we have a bucket with a hole at the bottom. When you make a request, you are adding water to the bucket, which is leaked at a constant rate. If the amount of water/request overflow the bucket capacity it overflows and the request is dropped.

Playground

Hit the request button, getting the bucket to fill, and drop some requests

Leak Rate
Next Leak
MAX: 5
Accepted
0
Dropped Requests
0

Implementation

Just like the Token Bucket, we do not use a background timer to constantly "drip" water out of the bucket every millisecond. That would be inefficient at scale. Instead, we calculate the leak lazily, only when a new request arrives.

We assume that between the last request and now, the bucket has been leaking correctly. We simply calculate how much should have leaked in that time gap and update the state instantly.

Here also we use the Redis hash data structure for each user / identifier to store 2 values:

  • Water level
  • Last updated time

And tune 2 values based on our requirements bucket size, and leaking rate.

For each request:

  • We get the water_level and last updated values

    redis.call("HMGET", key, "water_level", "last_updated")
  • Calculate how much leaked since last time

    local elapsed = math.max(0, now - last_updated)
    local leaked_amount = elapsed * rate
  • Update water level (cannot be negative)

    water_level = math.max(0, water_level - leaked_amount)

    Now we add the request to the updated current water level, if it is within the bucket's capacity, the request is accepted else rejected.

You can look at the complete implementation here.

Pros & Cons

Pros:

  • Smooth, predictable output: processes requests at a constant rate, preventing sudden traffic spikes from hitting your backend. This is ideal when downstream services can't handle bursts (e.g., third-party APIs with strict rate limits).

  • Fair processing: requests are handled in order (FIFO), which can be important for things like payment processing or job queues.

  • Memory efficient: like Token Bucket, just 2 values per user.


Cons:

  • No burst tolerance: even if the bucket is empty and your system is idle, users still can't send more than the leak rate allows. This feels restrictive compared to Token Bucket.

  • Can cause artificial delays: legitimate users might experience slow response times because the algorithm enforces a constant drip rate, even when the system has capacity.

  • Harder to tune: you need to match the leak rate to your actual system capacity, which requires deeper understanding of your backend throughput.

For the next set of algorithms we need some fresh air! Let's move from buckets to windows. πŸ’¨

Fixed Window

For fixed window algorithm we define a window duration say 5s, and number of requests allowed in each window say 5.

So we have rules like: We can hit 5 requests in each 5s window.

If the amount of request goes past the limit in the current window they are dropped. A new window starts after the window duration is over.

Playground

Play around. Try to hit the window limits.

WINDOW 10/5
WINDOW 20/5
WINDOW 30/5
0.0s
Total Allowed
0
Total Dropped
0
Capacity: 5 requests per window

Implementation

This is one of simplest algorithms to implement.

Here we just use a key values pair, to keep the count of number of accepted request in the current window.

If this is the first request, we create the window and add TTL for window duration.

-- 1. Increment the counter
local current_count = redis.call("INCR", key)
-- 2. If this is the first request in this window, set the expiry
if current_count == 1 then
redis.call("PEXPIRE", key, window_size)
end

Do the math for counter: If the number of request in the window is less than capacity: allowed else rejected.

After the window duration the key is deleted (thanks to TTL). For new request we again create a new key (window).

Complete implementation.

Pros & Cons

Pros:

  • Extremely simple: just increment a counter and check against a limit. No complex math or state management.

  • Very memory efficient: only 1 integer per user per window.

  • Fast: O(1) operations (increment + compare). Can handle millions of requests per second.

  • Easy to explain to users: "You can make 100 requests per hour" is straightforward.


Cons:

  • The boundary problem: users can send 2x the intended limit by clustering requests at window edges (e.g., 50 requests at 11:59:59 and 50 more at 12:00:01 = 100 requests in 2 seconds, even though the limit is "50 per minute").

  • Unfair to users who spread requests: someone making steady requests throughout the hour might get blocked, while someone who waits until the last second can burst through.

  • Resets can feel arbitrary: if you hit your limit at 10:05, you have to wait until 11:00 for a reset (even though you've been blocked for 55 minutes).

We just saw that the Fixed Window algorithm is simple but has a "double burst" flaw (allowing 2x traffic at the window boundary). To solve this with perfect precision, we use the Sliding Window Log algorithm.

Sliding window log

Instead of just counting "5 requests", we store the timestamp of all the accepted request in a log.

When a new request comes in, we look back exactly one window width (e.g., 60 seconds) and count how many logs fall in that range.

If the amount of requests in the log is less than the capacity, the request is accepted else rejected.

Playground

Here's the playground, hit some requests, try to understand why replicating the "double burst" issue here is not possible.


SLIDING WINDOW0/5
Current Time
Window Limit
Allowed
0
Dropped
0
Sliding Window Log β€’ Capacity: 5

Implementation

The implementation relies entirely on the Redis Sorted Set (ZSET).

A Sorted Set is like a standard Set (unique items), but every item has a Score attached to it. Redis sorts the items based on this score.

Here we store 2 information:

  • The Member (Value): The unique ID of the request.
  • The Score: The timestamp (in milliseconds).

By using the Timestamp as the Score, we can ask Redis questions like "Remove everything with a score less than X".

We need Lua to perform the Clean -> Count -> Add sequence atomically.

The flow for each request:

  • Calculate the start of the window

    local clear_before = now - window_size
  • Remove all elements older than the window start

    -- This is the "Sliding" part: dropping the tail of the window
    redis.call("ZREMRANGEBYSCORE", key, "-inf", clear_before)
  • Count how many requests are currently in the window (only counting the accepted request)

    local current_count = redis.call("ZCARD", key)
  • Check if allowed (compare the current_count with the limit):

    local allowed = 0
    if current_count < limit then
    allowed = 1
    -- Add the current request to the set
    -- Score = now, Member = unique_id
    redis.call("ZADD", key, now, unique_id)
    -- Set TTL to clean up if the user stops sending requests
    redis.call("PEXPIRE", key, window_size + 1000)
    current_count = current_count + 1
    end

Complete implementation.

Pros & Cons

Pros:

  • Perfectly accurate: no approximations or edge cases. If you say "100 requests per minute," users get exactly 100 requests in any rolling 60-second window.

  • No boundary exploitation: unlike Fixed Window, there's no way to game the system by clustering requests at window edges.

  • Fair distribution: every user's limit is continuously enforced based on actual recent activity, not arbitrary reset times.


Cons:

  • Memory intensive: for high limits (e.g., 5000 req/min), you're storing 5000 timestamps per user. With 10k concurrent users, that's 50 million entries in Redis.

  • Computationally expensive: every request requires cleaning old entries (ZREMRANGEBYSCORE) and counting (ZCARD), which involves O(log N) operations (where N = number of requests). This is heavier than simple counters.

  • Vulnerable to attacks: malicious actors can exhaust your Redis memory by creating millions of rate limit keys, each storing thousands of timestamps.

Sliding window counter

We've seen that the Sliding Window Log is precise but memory-heavy O(N), while the Fixed Window is efficient O(1) but inaccurate near window boundaries.

To get the best of both worlds, we use the Sliding Window Counter (often called the Hybrid or Weighted approach).

Instead of storing a timestamp for every request, we simply count requests in fixed time buckets (just like Fixed Window).

However, to solve the "boundary" issue, we look at two windows: the Current Window and the Previous Window.

We estimate the current load using a weighted formula:

WeightedΒ Count=CurrentΒ Count+(PreviousΒ CountΓ—TimeΒ RemainingΒ inΒ PreviousΒ WindowWindowΒ Size)\text{Weighted Count} = \text{Current Count} + \left( \text{Previous Count} \times \frac{\text{Time Remaining in Previous Window}}{\text{Window Size}} \right)

Since we just store the counter and not the timestamps of the request, we assume that all the request in the previous window were evenly distributed.

This way we get a pretty good estimation with just storing the counter for each window. Now this assumption can lead us to inaccurate results when the requests in the previous window are skewed at one end.

But it averages out and works most of the time.

Playground

Make some request, see the weighted count update real time and try to make sense of it.

ESTIMATE =
(Prev: 0 Γ— 100%)+(Curr: 0)
=0.00
PREVIOUS WINDOW
1.00x Weight
CURRENT WINDOW
NEXT WINDOW
SLIDING SCOPE (APPROX)
Current Time
Prev Window
0Weight: 100%
Curr Window
0Weight: 100%
Est. Total
0.0Capacity: 5
Request Status
ALLOW
Note : We only consider the accepted requests from previous window

Implementation

Here we just store 2 keys, one for the current window count and one for the previous.

For each request:

  • Calculate how much percentage of the current window is elapsed:

    local elapsed_in_window = now % window_size
  • Calculate the weighted count

    local overlap_factor = (window_size - elapsed_in_window) / window_size
    local weighted_count = current_count + (previous_count * overlap_factor)
  • Calculate if the request is allowed and update the values

    local allowed = 0
    if weighted_count < limit then
    allowed = 1
    -- Increment the current window counter
    redis.call("INCR", current_key)
    -- Set Expiry
    -- We need the data to last for at least 2 windows (current + when it becomes previous)
    -- Adding a small buffer (+1000ms) to be safe.
    redis.call("PEXPIRE", current_key, (window_size * 2) + 1000)
    -- Update the return estimate to reflect this new request
    weighted_count = weighted_count + 1
    end

    Complete Implementation

Pros & Cons

Pros:

  • Best of both worlds: combines Fixed Window's efficiency (O(1) operations, minimal memory) with nearly the accuracy of Sliding Window Log (99.997% correct according to Cloudflare).

  • Industry proven: battle-tested by companies like Cloudflare, GitHub, and Stripe handling billions of requests daily.

  • Memory efficient: just 2 counters per user (current + previous window), regardless of traffic volume.

  • No boundary exploitation: the weighted formula smooths out traffic at window edges, preventing the "double burst" problem.

  • Scales beautifully: whether your limit is 10 or 10 million requests, the cost remains constant.


Cons:

  • Slight inaccuracy: assumes requests in the previous window were evenly distributed. If someone sent all 1000 requests in the last second of the previous window, the algorithm might over-allow in the current window.

  • Edge case errors: according to Cloudflare's experiments, 0.003% of requests (roughly 3 in 100,000) are incorrectly allowed or blocked. This is negligible for most use cases but could matter in high-stakes scenarios like financial transactions.

  • Less intuitive to debug: when users ask "why was I rate limited?", explaining weighted counts is harder than pointing to a simple counter.

Choosing the Right Algorithm

Now that we've explored all five algorithms, let's break down when to use each one:

Token Bucket

Use when: You need to handle natural traffic bursts (file uploads, batch operations, user login after idle periods).

Best for: APIs with bursty workloads where occasional spikes are expected and safe.

Example: A photo upload API where users might upload 10 images at once after a vacation.

Avoid when: Bursts could overwhelm your backend (e.g., database writes, expensive computations).


Leaky Bucket

Use when: You need strict, predictable output rates to protect fragile downstream services.

Best for: Proxying third-party APIs with strict rate limits, job queues, or systems sensitive to traffic spikes.

Example: A service calling a payment gateway that explicitly requires "no more than 10 requests per second."

Avoid when: User experience matters more than backend protectionβ€”the forced delay can feel sluggish.


Fixed Window

Use when: Simplicity and speed are paramount, and slight inaccuracies are acceptable.

Best for: Internal services, low-stakes APIs, or scenarios where you just need "good enough" protection quickly.

Example: A dashboard limiting internal admin actions to 1000 per hour.

Avoid when: Users might exploit boundary issues, or you need fairness guarantees.


Sliding Window Log

Use when: You need absolute precision and have the resources to handle the memory/compute cost.

Best for: Financial APIs, security-critical systems, compliance scenarios, or low-limit high-value operations.

Example: An OTP verification endpoint allowing exactly 3 attempts per phone number per 5 minutes.

Avoid when: Handling high-throughput traffic (1000+ req/sec per user) or facing potential DDoS attacks.


Sliding Window Counter ⭐

Use when: You need production-grade rate limiting for high-traffic systems (the default choice for most engineers).

Best for: Public APIs, SaaS platforms, user-facing services, or anywhere you need to balance accuracy, performance, and scalability.

Example: A REST API serving thousands of users with limits like "1000 requests per hour."

Avoid when: You need absolute precision (use Sliding Window Log) or are building something quick-and-dirty (use Fixed Window).


Final Thoughts

If you're still unsure, start with Sliding Window Counter. It's what the industry has converged on for good reasonβ€”it handles 99% of use cases with minimal trade-offs.

That said, no algorithm is perfect. Here is a quick decision tree I follow:

Do you need perfect accuracy?
β”œβ”€ YES β†’ Sliding Window Log (but watch your memory!)
└─ NO β†’ Continue...
Is your backend fragile or rate-limited itself?
β”œβ”€ YES β†’ Leaky Bucket (smooth output rate)
└─ NO β†’ Continue...
Do you expect natural traffic bursts?
β”œβ”€ YES β†’ Token Bucket (allows accumulated credits)
└─ NO β†’ Continue...
Is this a quick prototype or internal tool?
β”œβ”€ YES β†’ Fixed Window (simplest implementation)
└─ NO β†’ Sliding Window Counter (production default)

That's all for this blog. Hope you enjoyed it! πŸ‘‹