Token Bucket: Burst Control for Rate Limiting

Windows vs Buckets: Different Mental Models
Window based rate limiting asks: "How many requests in the last N seconds?" Token and leaky bucket asks: "Do you have capacity right now?" Both achieve rate limiting but with different trade-offs. Understanding when to use each model is crucial for designing effective rate limiting systems that match your actual requirements.
Window Approach: Count Based
Track a count over time. Simple mental model: "max 100 requests per minute" is easy to explain to users and easy to verify. Implementation is straightforward: increment counter, compare to limit, reset periodically. Good for user facing rate limits where clarity matters. Downside: bursts at window boundaries, and the window length determines granularity. A 1 minute window cannot distinguish between 100 requests spread evenly versus 100 requests in 1 second.
Bucket Approach: Capacity Based
Track available capacity that refills over time. Token bucket starts full and depletes. Leaky bucket fills and drains. Both provide burst control that windows lack. With token bucket (r = 100/sec, b = 500), a user can burst 500 requests immediately but then must wait for refill. This is more nuanced than "100 per minute" and harder to explain to users, but provides better protection for downstream systems that are burst sensitive.
Combining Both Approaches
Real systems often layer multiple algorithms. Example: per second token bucket (r = 10, b = 20) for burst control plus daily fixed window counter (1,000/day) for quota management. The token bucket smooths traffic second by second, while the daily window ensures fair allocation over longer periods. Each layer provides different protection.
Decision Framework
Use windows when: user facing limits need clear explanation ("100 requests per hour"), billing or quota tracking requires exact window counts, or compliance requires verifiable time bounded limits. Use buckets when: downstream needs burst protection, traffic smoothing is more important than exact counts, or you need finer grained control over burst size versus sustained rate. Often, the answer is both: buckets for traffic shaping, windows for accounting.

💡 Key Takeaways

✓Combines steady rate and burst capacity in one mechanism. Amazon API Gateway typical limits: thousands of requests per second sustained, low thousands burst capacity, far more flexible than fixed windows.

✓State per key is just 16 bytes (8 byte token count, 8 byte timestamp). With 10 million keys, only 160 MB versus gigabytes for sliding log. Update operations are O(1).

✓Does not enforce strict per window quotas. A client with 500 token capacity and 100 per second refill can send 500 requests, wait 5 seconds, send 500 more (1,000 in 6 seconds, 166 per second average).

✓Better user experience than hard window cutoffs. Clients can absorb transient spikes without hitting limits, reducing false positive throttling and retry storms.

✓Distributed enforcement can use local buckets with periodic synchronization. Allow each node a fraction of global capacity, aggregate every 100 to 500 milliseconds, accepting bounded staleness for lower latency.

✓Thundering herd risk if all clients synchronize refills. Add client side jitter to retry logic and stagger bucket refill schedules across keys to smooth load.

📌 Interview Tips

1API Gateway configuration: capacity = 2,000 tokens, refill = 1,000 tokens per second. Client sends 2,000 requests at t=0 (bucket empty). At t=2s, bucket has 2,000 tokens again. Client can burst another 2,000 requests, sustaining 1,000 req/sec average.

2Implementation in code: tokens = min(capacity, tokens + (now - last_update) * refill_rate). If tokens >= cost: tokens -= cost, allow request. Else: deny. Finally: last_update = now. All O(1) operations.

3Reddit scale: 500,000 OAuth clients, 60 requests per minute sustained (1 per second), burst 10. State = 500,000 × 16 bytes = 8 MB for tokens and timestamps. Far smaller than sliding window counters at same scale.

← Back to Fixed vs Sliding Window Overview