Rate LimitingToken Bucket AlgorithmEasy⏱️ ~3 min

Token Bucket Algorithm: Core Mechanics and Burst Control

Token Bucket is a rate limiting algorithm defined by two parameters: refill rate r (tokens per second) and capacity b (maximum tokens). Tokens accumulate continuously during idle periods up to capacity b. Each incoming request consumes one or more tokens. If tokens are available, the request passes immediately with sub-microsecond overhead. If the bucket is empty, the request is rejected with a 429 response or optionally queued. The key property is bounded burst admission. Over any time interval T, the algorithm guarantees at most r·T + min(b, tokens_at_start) requests pass through. This means a client arriving after an idle period can immediately send up to b requests as a burst, then sustain r requests per second indefinitely. For example, with r = 1,000 tokens/sec and b = 2,000 tokens, a cold start allows a 2,000 request spike followed by 1,000 requests/sec steady state. The refill calculation uses elapsed time: on each check, compute tokens = min(b, current_tokens + r·Δt) where Δt is time since last refill. Always use monotonic clocks to avoid token jumps during NTP adjustments. A local in-process token bucket adds roughly 200 nanoseconds per check and can sustain over 5 million checks per second per CPU core, making it negligible overhead for most applications. This differs fundamentally from Leaky Bucket, which paces output at a nearly constant rate and implicitly queues requests. Token Bucket prioritizes low latency and user-perceived snappiness by allowing bursts, making it ideal for user-facing APIs where downstream systems can absorb short spikes without degradation.
💡 Key Takeaways
Two parameters define behavior: refill rate r (tokens/sec) and capacity b (max tokens). Typical production values are r = 1,000 to 10,000 tokens/sec with b sized to 0.5 to 1.0 seconds of r to balance burst tolerance and downstream protection.
Continuous refill using elapsed time prevents synchronized bursts at second boundaries. Calculate tokens as min(b, current + r·Δt) on every request using monotonic clocks to avoid NTP induced jumps or stalls.
Weighted token consumption allows capacity aware limiting. A heavy query can consume w = 5 tokens while a light request costs w = 1, naturally throttling expensive operations more aggressively than cheap ones.
Local in-process buckets add under 200 nanoseconds per check and sustain over 5 million decisions per second per core. AWS API Gateway uses this approach with negligible overhead, returning 429 rejections within single-digit milliseconds.
Over any interval T, guarantees at most r·T + min(b, initial_tokens) admissions. With r = 8,000/sec and b = 2,000, a 10 second window allows maximum 82,000 requests (80,000 steady + 2,000 burst).
📌 Examples
AWS API Gateway exposes rate and burst parameters mapping directly to token bucket. Typical account defaults are r = 10,000 requests/sec per Region with b = several thousand, allowing cold clients to spike immediately then sustain 10k RPS without 429 errors.
Kubernetes client-go uses token bucket with QPS = 5 and burst = 10 per client instance. On a node with 50 controllers, aggregate smooths to a few hundred RPS naturally without centralized coordination, preventing API server overload.
Envoy local rate limiting filters implement token buckets with r = 1,000 RPS, b = 200 for per-route limits. Decision overhead measures under 100 microseconds, and operators chain multiple buckets (per IP and per API key) to bound abuse while preserving legitimate bursts.
← Back to Token Bucket Algorithm Overview