Distributed Rate Limiting: Algorithms & Overshoot
Rate limiting sounds simple: count requests, block the ones that exceed your threshold. In a single process it takes ten lines of code. In a distributed system, where your API is served by dozens of pods each with its own in-memory state, it becomes a subtle coordination problem, and getting it wrong means your limit of 100 req/s becomes 800 req/s under a burst.
This post covers the four standard algorithms, the exact mechanics of overshoot, and how to fix it.
The Four Algorithms
1. Fixed Window Counter
Divide time into windows of fixed size (e.g., each second). Keep a counter per window. Increment on every request; reject when the counter exceeds the limit.
def is_allowed(user_id: str, limit: int) -> bool:
window = int(time.time()) # seconds since epoch
key = f"rl:{user_id}:{window}"
count = redis.incr(key)
if count == 1:
redis.expire(key, 2) # clean up after window passes
return count <= limit
Pros: O(1) time and space per request. Dead simple. Cons: Vulnerable to boundary attacks (see the demo below).
2. Sliding Window Log
Store a timestamp for every request in a sorted set. On each new request, remove timestamps older than window_size, then check the set's size.
def is_allowed(user_id: str, limit: int, window_ms: int) -> bool:
now = time.time_ns() // 1_000_000 # ms
key = f"rl:{user_id}"
pipe = redis.pipeline()
pipe.zremrangebyscore(key, 0, now - window_ms)
pipe.zadd(key, {str(now): now})
pipe.zcard(key)
pipe.expire(key, window_ms // 1000 + 1)
_, _, count, _ = pipe.execute()
return count <= limit
Pros: Perfectly accurate, no boundary artifacts. Cons: Memory grows with request volume; each admitted request costs a sorted-set entry.
3. Sliding Window Counter (hybrid)
A practical middle ground: keep counters for the current and previous window, then interpolate.
rate = prev_count × (1 - elapsed/window) + curr_count
This approximates the sliding log with O(1) space. Redis's built-in sliding window rate limiter uses this approach.
4. Token Bucket
A bucket holds up to capacity tokens and refills at rate tokens/second. Each request consumes one token. Requests are rejected when the bucket is empty.
type TokenBucket struct {
capacity float64
tokens float64
rate float64 // tokens/second
lastCheck time.Time
}
func (b *TokenBucket) Allow() bool {
now := time.Now()
elapsed := now.Sub(b.lastCheck).Seconds()
b.lastCheck = now
b.tokens = min(b.capacity, b.tokens+elapsed*b.rate)
if b.tokens < 1 {
return false
}
b.tokens--
return true
}
Pros: Allows bursts up to capacity while enforcing a steady-state rate. Good for APIs where you want to absorb short spikes.
Cons: Two parameters to tune (capacity and rate), and distributed token buckets require careful state sharing.
5. Leaky Bucket
Requests enter a FIFO queue (the "bucket") and are processed at a fixed drain rate. Overflow is dropped. Unlike token bucket, the output rate is strictly constant; there is no bursting.
incoming ──► [queue, max_size=capacity] ──► processed at drain_rate req/s
↓ overflow dropped
Pros: Completely smooths traffic; ideal when downstream can't handle any bursting. Cons: Adds latency (queuing). A sudden burst fills the queue, delaying legitimate traffic even after the burst ends.
The Fixed Window Boundary Problem
Here is the canonical attack. The rate limiter allows limit requests per 1-second window. The attacker sends limit requests at the end of window 1, then limit more at the start of window 2. The counter resets between the two windows, so both batches are fully admitted, but they land within a rolling 1-second span, doubling your effective limit.
Fixed Window: Boundary Attack Demo
Drag the slider right → bursts compress in time → same 2× overshoot, arbitrarily high instantaneous rate
Notice that moving the slider further right compresses the two bursts in time. The overshoot is always 2× the limit; what changes is how quickly those requests arrive. At 99%, two full-limit bursts land within ~10 ms, an effectively instantaneous spike of 2 × limit requests hitting your backend.
The sliding window algorithms exist specifically to eliminate this artifact. With a true sliding window, the count always reflects the last window_ms of traffic regardless of window alignment, so there is no safe boundary to exploit.
Going Distributed
Single-node rate limiting is solved. The hard part is coordination across multiple nodes.
Architecture
client ──► load balancer ──► node-1 ──► backend
├─► node-2 ──► backend
└─► node-3 ──► backend
│
shared Redis counter
Every node reads and writes a shared counter in Redis. In theory this gives you a single authoritative count. In practice, several things go wrong.
Problem 1: Stale Local Cache
To reduce latency, teams cache the Redis counter locally for a few hundred milliseconds and only sync periodically. During that sync gap, each node makes admission decisions based on stale data.
If three nodes all cache a count of 0 and a burst arrives before the next sync, each independently admits up to the full limit, multiplying your actual throughput by the number of nodes.
Distributed Rate Limiting: Sync Lag Overshoot
3 nodes × ~400ms stale cache → each admits independently → 1.8× intended limit
Try setting sync interval to 0: that represents an atomic Redis call on every request (no local cache), which eliminates overshoot entirely. Now drag it up. At 400ms and 3 nodes you're already admitting 1.8× the limit. Push the interval to a full 1000ms and it reaches the theoretical maximum of nodeCount × limit, which is 3× here.
Problem 2: Check-Then-Act Races
Even without caching, a non-atomic read-increment-check sequence has a race window:
node-1: GET counter → 99 # both nodes see 99
node-2: GET counter → 99 # both think one slot remains
node-1: INCR counter → 100 # admitted ✓
node-2: INCR counter → 101 # also admitted ✗ (overshoot by 1)
With N nodes under high concurrency, you can overshoot by up to N–1 per burst. This is the classic TOCTOU (time-of-check to time-of-use) problem.
Problem 3: Clock Skew
Fixed-window counters reset based on wall-clock time. If node-1 and node-2 disagree on the current second by even a few hundred milliseconds, they reset their windows at different times, creating a distributed version of the boundary problem on top of the single-node one.
Solutions
Atomic Redis INCR (the simplest fix)
Replace GET + INCR with a single atomic INCR. If the returned value is 1, set the TTL. This is safe because Redis is single-threaded: no two INCR calls interleave.
def is_allowed_atomic(user_id: str, limit: int) -> bool:
key = f"rl:{user_id}:{int(time.time())}"
# INCR is atomic, no race between read and increment
count = redis.incr(key)
if count == 1:
redis.expire(key, 2)
return count <= limit
This eliminates the check-then-act race entirely. It still has the fixed-window boundary problem, but that's a separate concern.
Lua Script for Multi-Step Atomicity
For sliding window counter (two keys, interpolation math), use a Lua script. Redis executes scripts atomically:
-- KEYS[1] = current window key
-- KEYS[2] = previous window key
-- ARGV[1] = limit, ARGV[2] = window_ms, ARGV[3] = now_ms
local now = tonumber(ARGV[3])
local window = tonumber(ARGV[2])
local limit = tonumber(ARGV[1])
local curr_start = now - (now % window)
local elapsed = (now - curr_start) / window -- 0.0 → 1.0
local prev = tonumber(redis.call('GET', KEYS[2])) or 0
local curr = tonumber(redis.call('GET', KEYS[1])) or 0
-- weighted estimate of requests in the last `window` ms
local rate = prev * (1 - elapsed) + curr
if rate >= limit then
return 0
end
redis.call('INCR', KEYS[1])
redis.call('PEXPIRE', KEYS[1], window * 2)
return 1
The script reads both counters and increments atomically, so no node can observe an inconsistent intermediate state.
Sliding Window with Sorted Set (exact, higher cost)
def is_allowed_exact(user_id: str, limit: int, window_ms: int) -> bool:
now = int(time.time() * 1000)
key = f"rl:sw:{user_id}"
with redis.pipeline() as pipe:
pipe.multi()
# remove entries outside the window
pipe.zremrangebyscore(key, 0, now - window_ms)
# add this request (score = timestamp for ordering)
pipe.zadd(key, {f"{now}-{os.urandom(4).hex()}": now})
# count requests in window
pipe.zcard(key)
pipe.expire(key, window_ms // 1000 + 1)
results = pipe.execute()
count = results[2]
return count <= limit
This gives exact sliding-window semantics with no boundary artifacts. The trade-off is memory: each request writes one entry. For high-volume APIs, consider the sliding window counter (interpolation approach) instead.
Token Bucket in Redis
Store (tokens, last_refill_timestamp) as a Redis hash. Use a Lua script to atomically refill and consume:
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local rate = tonumber(ARGV[2]) -- tokens/ms
local now = tonumber(ARGV[3]) -- ms
local data = redis.call('HMGET', key, 'tokens', 'ts')
local tokens = tonumber(data[1]) or capacity
local last_ts = tonumber(data[2]) or now
-- refill based on elapsed time
local elapsed = now - last_ts
tokens = math.min(capacity, tokens + elapsed * rate)
if tokens < 1 then
return 0 -- rejected
end
tokens = tokens - 1
redis.call('HMSET', key, 'tokens', tokens, 'ts', now)
redis.call('PEXPIRE', key, math.ceil(capacity / rate) + 1000)
return 1 -- admitted
This is what AWS API Gateway and many service meshes use internally. The Lua atomicity guarantee means no race, and the token model gives you burst absorption for free.
Choosing the Right Algorithm
| Algorithm | Overshoot risk | Memory | Burst tolerance | Best for |
|---|---|---|---|---|
| Fixed window | High (2×) | O(1) | None | Internal services, simple budgets |
| Sliding window log | None | O(requests) | None | Strict per-user quotas |
| Sliding window counter | Tiny (~1%) | O(1) | None | High-traffic public APIs |
| Token bucket | None (atomic) | O(1) | Yes (up to capacity) | APIs with bursty-but-fair usage |
| Leaky bucket | None | O(queue) | No (queue smooths) | Upstream traffic shaping |
For most distributed APIs: sliding window counter via Lua or token bucket via Lua are the sweet spot: O(1) space, atomic, no boundary issues.
What to Say in a System Design Interview
When rate limiting comes up in an interview, walk through it in this order:
-
Clarify scope: per user? per IP? per API key? global? Different answers lead to different key structures in Redis.
-
Name the algorithm and justify it: "I'd use a token bucket because our users have bursty-but-infrequent traffic patterns and I want to absorb short spikes." Don't just say "rate limiter."
-
Address the distributed problem explicitly. This is what separates strong candidates. Say: "In a multi-node deployment, I'd store the counter in Redis and use an atomic Lua script to avoid check-then-act races. I'd avoid local caching of the counter unless latency benchmarks show Redis is a bottleneck."
-
Discuss failure modes: what happens if Redis is unavailable? Two options: fail open (admit all traffic, risk overload) or fail closed (reject all traffic, risk availability). Most production systems fail open with alerting.
-
Mention the key structure:
rl:{user_id}:{window}for fixed window,rl:{user_id}for sliding/token bucket. Interviewers appreciate operational thinking.
The deeper you understand why naive distributed implementations overshoot (stale caches, TOCTOU races, clock skew), the more convincingly you can argue for the atomic Redis approach rather than just pattern-matching to "use Redis."