Distributed Leaky Bucket: Coordination and Consistency
The Multi Server Problem
Leaky bucket works perfectly on a single server, but what happens when you have 10 servers, each with its own bucket? If each allows 100 requests/sec, one determined user could send 100 requests to each server, achieving 1,000/sec total while each server thinks it is enforcing limits correctly. This is the fundamental distributed rate limiting challenge.
Option 1: Divide the Budget
Give each server a fraction of the global limit. With 10 servers and 1,000/sec global limit, each gets r = 100/sec. Simple, no coordination needed, but wastes capacity when traffic is uneven. If 80% of requests hit Server A while others sit idle, you reject at 100/sec while 900 tokens/sec go unused across other servers. Works well when load balancing is consistent; fails when it is not.
Option 2: Central Coordination
All servers share one bucket state in Redis. Every request checks globally: read current level, decide admit or reject, update level. You get precise enforcement but add 0.5 to 2 ms network latency per request. At 10,000 requests/sec, that is 10,000 Redis operations per second. Single Redis handles 100K to 200K ops/sec, so high volume systems risk making the rate limiter itself the bottleneck.
Option 3: Token Leasing
Servers grab leases of capacity from the global bucket. Server A requests 200 tokens from Redis, enforces locally until depleted, then requests another batch. Reduces Redis operations from 10,000/sec to 50/sec (10,000 / 200). Trade-off: during sudden traffic shifts, one server might hoard tokens while others starve. Lease expiration (unused tokens return to pool after 1 to 5 seconds) mitigates this.
Consistency vs Availability
When Redis is unavailable, what happens? Strict enforcement means rejecting all requests (dangerous for availability). Lenient fallback means allowing requests without limits (dangerous for downstream). Common compromise: fall back to local per server limits during Redis outages. You lose global precision but maintain protection. Log incidents for investigation.