Production Deployment: Local vs Distributed Token Buckets
The Core Decision
Where do you keep the token bucket? Two options: each server keeps its own bucket (local), or all servers share one bucket through a central store (distributed). This is not just implementation detail. It fundamentally changes how rate limiting behaves, with implications for latency, accuracy, and failure modes.
Local Buckets
Think of multiple bouncers at different club entrances, each counting their own guests. If you have 8 API servers and want 10,000 requests/sec globally for a user, give each server 1,250/sec. Simple, fast (sub microsecond decisions, typically 50 to 200 nanoseconds for atomic counter), no coordination needed. The catch: if 70 percent of traffic hits Server A, that server rejects requests while others have spare capacity. With 7,000 requests to Server A (limit 1,250) and 3,000 across other 7 servers, you reject 5,750 while 5,750 tokens go unused.
Distributed Buckets
All servers share exact same bucket state, usually in Redis using INCR and EXPIRE. Every request checks globally and decrements atomically. You get precise enforcement: limit 10,000/sec allows exactly 10,000/sec regardless of traffic distribution. The cost: every request adds 0.2 to 2 ms network round trip. At 50,000 requests/sec, that is 50,000 Redis operations/sec. Single Redis handles 100,000 to 200,000 ops/sec, so high volume systems risk making rate limiter itself a bottleneck.
Hybrid Token Leasing
Most production systems blend both. Servers grab leases of tokens from global bucket: Server A requests 500 tokens from Redis, enforces locally (zero network for 500 requests), then requests another batch. Reduces Redis ops from 50,000/sec to 100/sec (50,000 / 500 lease). Tradeoff: might overshoot by (num_servers times lease_size) during spikes. With 8 servers and 500 token leases, could exceed by 4,000 requests before refresh. Tune lease size based on acceptable overshoot.
When to Use Which
Use local when speed matters and approximate enforcement is acceptable (internal services, non billable limits). Use distributed when exact limits are critical (billing, paid API tiers, compliance). Use hybrid for user facing APIs at scale: most systems accept 5 to 10 percent overshoot for 10x reduction in coordination overhead.