Rate LimitingToken Bucket AlgorithmMedium⏱️ ~3 min

Production Deployment: Local vs Distributed Token Buckets

The Core Decision

Where do you keep the token bucket? Two options: each server keeps its own bucket (local), or all servers share one bucket through a central store (distributed). This is not just implementation detail. It fundamentally changes how rate limiting behaves, with implications for latency, accuracy, and failure modes.

Local Buckets

Think of multiple bouncers at different club entrances, each counting their own guests. If you have 8 API servers and want 10,000 requests/sec globally for a user, give each server 1,250/sec. Simple, fast (sub microsecond decisions, typically 50 to 200 nanoseconds for atomic counter), no coordination needed. The catch: if 70 percent of traffic hits Server A, that server rejects requests while others have spare capacity. With 7,000 requests to Server A (limit 1,250) and 3,000 across other 7 servers, you reject 5,750 while 5,750 tokens go unused.

Distributed Buckets

All servers share exact same bucket state, usually in Redis using INCR and EXPIRE. Every request checks globally and decrements atomically. You get precise enforcement: limit 10,000/sec allows exactly 10,000/sec regardless of traffic distribution. The cost: every request adds 0.2 to 2 ms network round trip. At 50,000 requests/sec, that is 50,000 Redis operations/sec. Single Redis handles 100,000 to 200,000 ops/sec, so high volume systems risk making rate limiter itself a bottleneck.

Hybrid Token Leasing

Most production systems blend both. Servers grab leases of tokens from global bucket: Server A requests 500 tokens from Redis, enforces locally (zero network for 500 requests), then requests another batch. Reduces Redis ops from 50,000/sec to 100/sec (50,000 / 500 lease). Tradeoff: might overshoot by (num_servers times lease_size) during spikes. With 8 servers and 500 token leases, could exceed by 4,000 requests before refresh. Tune lease size based on acceptable overshoot.

When to Use Which

Use local when speed matters and approximate enforcement is acceptable (internal services, non billable limits). Use distributed when exact limits are critical (billing, paid API tiers, compliance). Use hybrid for user facing APIs at scale: most systems accept 5 to 10 percent overshoot for 10x reduction in coordination overhead.

💡 Key Takeaways
Local buckets: 50 to 200 ns decisions, no coordination, but uneven traffic wastes capacity (70 percent to one server rejects while others idle)
Distributed buckets: precise global enforcement via Redis, but adds 0.2 to 2 ms per request; bottleneck at 100K to 200K ops/sec
Hybrid token leasing: grab 500 token batches from Redis, enforce locally; reduces Redis ops from 50K/sec to 100/sec
Lease tradeoff: overshoot by (servers times lease_size) during spikes; 8 servers with 500 lease equals 4,000 potential overshoot
Decision: local for internal/non billable, distributed for billing/compliance, hybrid for user facing APIs at scale
📌 Interview Tips
1Walk through hybrid leasing: 8 servers, 500 token lease each. Server grabs 500 from Redis, serves 500 locally (no network), then refreshes. Reduces 50K Redis ops to 100/sec.
2Explain local bucket waste: 70 percent traffic to Server A with 1250 limit rejects 5750 while other servers have 5750 unused tokens.
← Back to Token Bucket Algorithm Overview