LeetDesign
·10 min read

Distributed Rate Limiting: Algorithms & Overshoot

rate-limitingdistributed-systemssystem-design

Rate limiting sounds simple: count requests, block the ones that exceed your threshold. In a single process it takes ten lines of code. In a distributed system, where your API is served by dozens of pods each with its own in-memory state, it becomes a subtle coordination problem, and getting it wrong means your limit of 100 req/s becomes 800 req/s under a burst.

This post covers the four standard algorithms, the exact mechanics of overshoot, and how to fix it.


The Four Algorithms

1. Fixed Window Counter

Divide time into windows of fixed size (e.g., each second). Keep a counter per window. Increment on every request; reject when the counter exceeds the limit.

def is_allowed(user_id: str, limit: int) -> bool:
    window = int(time.time())          # seconds since epoch
    key = f"rl:{user_id}:{window}"
    count = redis.incr(key)
    if count == 1:
        redis.expire(key, 2)           # clean up after window passes
    return count <= limit

Pros: O(1) time and space per request. Dead simple. Cons: Vulnerable to boundary attacks (see the demo below).

2. Sliding Window Log

Store a timestamp for every request in a sorted set. On each new request, remove timestamps older than window_size, then check the set's size.

def is_allowed(user_id: str, limit: int, window_ms: int) -> bool:
    now = time.time_ns() // 1_000_000   # ms
    key = f"rl:{user_id}"
    pipe = redis.pipeline()
    pipe.zremrangebyscore(key, 0, now - window_ms)
    pipe.zadd(key, {str(now): now})
    pipe.zcard(key)
    pipe.expire(key, window_ms // 1000 + 1)
    _, _, count, _ = pipe.execute()
    return count <= limit

Pros: Perfectly accurate, no boundary artifacts. Cons: Memory grows with request volume; each admitted request costs a sorted-set entry.

3. Sliding Window Counter (hybrid)

A practical middle ground: keep counters for the current and previous window, then interpolate.

rate = prev_count × (1 - elapsed/window) + curr_count

This approximates the sliding log with O(1) space. Redis's built-in sliding window rate limiter uses this approach.

4. Token Bucket

A bucket holds up to capacity tokens and refills at rate tokens/second. Each request consumes one token. Requests are rejected when the bucket is empty.

type TokenBucket struct {
    capacity  float64
    tokens    float64
    rate      float64   // tokens/second
    lastCheck time.Time
}

func (b *TokenBucket) Allow() bool {
    now := time.Now()
    elapsed := now.Sub(b.lastCheck).Seconds()
    b.lastCheck = now
    b.tokens = min(b.capacity, b.tokens+elapsed*b.rate)
    if b.tokens < 1 {
        return false
    }
    b.tokens--
    return true
}

Pros: Allows bursts up to capacity while enforcing a steady-state rate. Good for APIs where you want to absorb short spikes. Cons: Two parameters to tune (capacity and rate), and distributed token buckets require careful state sharing.

5. Leaky Bucket

Requests enter a FIFO queue (the "bucket") and are processed at a fixed drain rate. Overflow is dropped. Unlike token bucket, the output rate is strictly constant; there is no bursting.

incoming ──► [queue, max_size=capacity] ──► processed at drain_rate req/s
                  ↓ overflow dropped

Pros: Completely smooths traffic; ideal when downstream can't handle any bursting. Cons: Adds latency (queuing). A sudden burst fills the queue, delaying legitimate traffic even after the burst ends.


The Fixed Window Boundary Problem

Here is the canonical attack. The rate limiter allows limit requests per 1-second window. The attacker sends limit requests at the end of window 1, then limit more at the start of window 2. The counter resets between the two windows, so both batches are fully admitted, but they land within a rolling 1-second span, doubling your effective limit.

Fixed Window: Boundary Attack Demo

Limit (req / window):20
Attack position in W1:85%
Window 1 (limit=20)Window 2 (limit=20)1-second rolling window (sees 40 requests)20 req20 req01000ms2000ms
Rate limiter sees
20 + 20
each window OK
Admitted in 1 sec
40
intended: 20
Gap between bursts
150ms
~267 req/s burst

Drag the slider right → bursts compress in time → same 2× overshoot, arbitrarily high instantaneous rate

Notice that moving the slider further right compresses the two bursts in time. The overshoot is always 2× the limit; what changes is how quickly those requests arrive. At 99%, two full-limit bursts land within ~10 ms, an effectively instantaneous spike of 2 × limit requests hitting your backend.

The sliding window algorithms exist specifically to eliminate this artifact. With a true sliding window, the count always reflects the last window_ms of traffic regardless of window alignment, so there is no safe boundary to exploit.


Going Distributed

Single-node rate limiting is solved. The hard part is coordination across multiple nodes.

Architecture

client ──► load balancer ──► node-1 ──► backend
                         ├─► node-2 ──► backend
                         └─► node-3 ──► backend
                                │
                         shared Redis counter

Every node reads and writes a shared counter in Redis. In theory this gives you a single authoritative count. In practice, several things go wrong.

Problem 1: Stale Local Cache

To reduce latency, teams cache the Redis counter locally for a few hundred milliseconds and only sync periodically. During that sync gap, each node makes admission decisions based on stale data.

If three nodes all cache a count of 0 and a burst arrives before the next sync, each independently admits up to the full limit, multiplying your actual throughput by the number of nodes.

Distributed Rate Limiting: Sync Lag Overshoot

Number of nodes:3
Global limit (req/s):60
Counter sync interval:400ms
node-1
36
admitted
node-2
36
admitted
node-3
36
admitted
Total admitted vs limit108 / 60
60
+48
Overshoot:+80%

3 nodes × ~400ms stale cache → each admits independently → 1.8× intended limit

Try setting sync interval to 0: that represents an atomic Redis call on every request (no local cache), which eliminates overshoot entirely. Now drag it up. At 400ms and 3 nodes you're already admitting 1.8× the limit. Push the interval to a full 1000ms and it reaches the theoretical maximum of nodeCount × limit, which is 3× here.

Problem 2: Check-Then-Act Races

Even without caching, a non-atomic read-increment-check sequence has a race window:

node-1: GET counter  → 99   # both nodes see 99
node-2: GET counter  → 99   # both think one slot remains
node-1: INCR counter → 100  # admitted ✓
node-2: INCR counter → 101  # also admitted ✗ (overshoot by 1)

With N nodes under high concurrency, you can overshoot by up to N–1 per burst. This is the classic TOCTOU (time-of-check to time-of-use) problem.

Problem 3: Clock Skew

Fixed-window counters reset based on wall-clock time. If node-1 and node-2 disagree on the current second by even a few hundred milliseconds, they reset their windows at different times, creating a distributed version of the boundary problem on top of the single-node one.


Solutions

Atomic Redis INCR (the simplest fix)

Replace GET + INCR with a single atomic INCR. If the returned value is 1, set the TTL. This is safe because Redis is single-threaded: no two INCR calls interleave.

def is_allowed_atomic(user_id: str, limit: int) -> bool:
    key = f"rl:{user_id}:{int(time.time())}"
    # INCR is atomic, no race between read and increment
    count = redis.incr(key)
    if count == 1:
        redis.expire(key, 2)
    return count <= limit

This eliminates the check-then-act race entirely. It still has the fixed-window boundary problem, but that's a separate concern.

Lua Script for Multi-Step Atomicity

For sliding window counter (two keys, interpolation math), use a Lua script. Redis executes scripts atomically:

-- KEYS[1] = current window key
-- KEYS[2] = previous window key
-- ARGV[1] = limit, ARGV[2] = window_ms, ARGV[3] = now_ms
local now        = tonumber(ARGV[3])
local window     = tonumber(ARGV[2])
local limit      = tonumber(ARGV[1])
local curr_start = now - (now % window)
local elapsed    = (now - curr_start) / window   -- 0.0 → 1.0

local prev = tonumber(redis.call('GET', KEYS[2])) or 0
local curr = tonumber(redis.call('GET', KEYS[1])) or 0

-- weighted estimate of requests in the last `window` ms
local rate = prev * (1 - elapsed) + curr

if rate >= limit then
    return 0
end

redis.call('INCR', KEYS[1])
redis.call('PEXPIRE', KEYS[1], window * 2)
return 1

The script reads both counters and increments atomically, so no node can observe an inconsistent intermediate state.

Sliding Window with Sorted Set (exact, higher cost)

def is_allowed_exact(user_id: str, limit: int, window_ms: int) -> bool:
    now = int(time.time() * 1000)
    key = f"rl:sw:{user_id}"

    with redis.pipeline() as pipe:
        pipe.multi()
        # remove entries outside the window
        pipe.zremrangebyscore(key, 0, now - window_ms)
        # add this request (score = timestamp for ordering)
        pipe.zadd(key, {f"{now}-{os.urandom(4).hex()}": now})
        # count requests in window
        pipe.zcard(key)
        pipe.expire(key, window_ms // 1000 + 1)
        results = pipe.execute()

    count = results[2]
    return count <= limit

This gives exact sliding-window semantics with no boundary artifacts. The trade-off is memory: each request writes one entry. For high-volume APIs, consider the sliding window counter (interpolation approach) instead.

Token Bucket in Redis

Store (tokens, last_refill_timestamp) as a Redis hash. Use a Lua script to atomically refill and consume:

local key       = KEYS[1]
local capacity  = tonumber(ARGV[1])
local rate      = tonumber(ARGV[2])   -- tokens/ms
local now       = tonumber(ARGV[3])   -- ms

local data      = redis.call('HMGET', key, 'tokens', 'ts')
local tokens    = tonumber(data[1]) or capacity
local last_ts   = tonumber(data[2]) or now

-- refill based on elapsed time
local elapsed   = now - last_ts
tokens          = math.min(capacity, tokens + elapsed * rate)

if tokens < 1 then
    return 0   -- rejected
end

tokens = tokens - 1
redis.call('HMSET', key, 'tokens', tokens, 'ts', now)
redis.call('PEXPIRE', key, math.ceil(capacity / rate) + 1000)
return 1       -- admitted

This is what AWS API Gateway and many service meshes use internally. The Lua atomicity guarantee means no race, and the token model gives you burst absorption for free.


Choosing the Right Algorithm

AlgorithmOvershoot riskMemoryBurst toleranceBest for
Fixed windowHigh (2×)O(1)NoneInternal services, simple budgets
Sliding window logNoneO(requests)NoneStrict per-user quotas
Sliding window counterTiny (~1%)O(1)NoneHigh-traffic public APIs
Token bucketNone (atomic)O(1)Yes (up to capacity)APIs with bursty-but-fair usage
Leaky bucketNoneO(queue)No (queue smooths)Upstream traffic shaping

For most distributed APIs: sliding window counter via Lua or token bucket via Lua are the sweet spot: O(1) space, atomic, no boundary issues.


What to Say in a System Design Interview

When rate limiting comes up in an interview, walk through it in this order:

  1. Clarify scope: per user? per IP? per API key? global? Different answers lead to different key structures in Redis.

  2. Name the algorithm and justify it: "I'd use a token bucket because our users have bursty-but-infrequent traffic patterns and I want to absorb short spikes." Don't just say "rate limiter."

  3. Address the distributed problem explicitly. This is what separates strong candidates. Say: "In a multi-node deployment, I'd store the counter in Redis and use an atomic Lua script to avoid check-then-act races. I'd avoid local caching of the counter unless latency benchmarks show Redis is a bottleneck."

  4. Discuss failure modes: what happens if Redis is unavailable? Two options: fail open (admit all traffic, risk overload) or fail closed (reject all traffic, risk availability). Most production systems fail open with alerting.

  5. Mention the key structure: rl:{user_id}:{window} for fixed window, rl:{user_id} for sliding/token bucket. Interviewers appreciate operational thinking.

The deeper you understand why naive distributed implementations overshoot (stale caches, TOCTOU races, clock skew), the more convincingly you can argue for the atomic Redis approach rather than just pattern-matching to "use Redis."