What is Distributed Rate Limiting and Why is it Needed?

Definition
Distributed Rate Limiting enforces request limits across multiple servers by sharing state through a central store (Redis, Memcached) or coordination protocol. Unlike local rate limiting where each server tracks its own counters, distributed rate limiting ensures global limits are respected regardless of which server handles each request.
The Concert Ticket Analogy
Imagine a concert with 1,000 tickets total. You set up 10 ticket booths to handle the crowd. If each booth independently tracks "up to 1,000 tickets sold," you massively oversell. The booths need to share state: "How many tickets have ALL booths sold combined?" That shared counter is what distributed rate limiting provides for API requests across your server fleet.
Why Local Rate Limiting Fails
With 10 servers and a 1,000 requests/sec global limit, local limiting gives each server 100/sec. If traffic is evenly distributed, this works. But if 80% of traffic hits one server (sticky sessions, hot user, geographic concentration), that server rejects at 100/sec while others have 900/sec of unused capacity. Local limits waste capacity and frustrate users unevenly.
The Core Challenge
Distributed systems face the CAP theorem: you cannot have consistency, availability, and partition tolerance simultaneously. For rate limiting, this means choosing between exact enforcement (strong consistency, requires coordination) and high availability (eventual consistency, may overshoot limits). Most systems choose "good enough" accuracy with high availability.
Central Store Pattern
The most common approach uses Redis as a central counter store. Every request checks and updates Redis atomically. With 10,000 requests/sec across your API, that is 10,000 Redis operations per second. Single Redis handles 100K to 200K ops/sec, so this scales reasonably. But each request now adds 0.5 to 2 ms network latency for the Redis round trip.

💡 Key Takeaways

✓Without shared state, 10 servers each allowing 100 requests per minute actually permit 1,000 requests per minute system wide, making limits meaningless

✓Production systems see single digit millisecond p99 latency overhead for in region rate limit decisions, requiring one atomic read modify write operation per request

✓Amazon API Gateway and Google Cloud Endpoints enforce token bucket quotas at edge locations with centralized configuration, returning 429 status codes when limits are exceeded

✓Keys are often composite dimensions like tenant ID plus endpoint to enforce fair capacity allocation across multiple tenants and prevent abuse from specific users

📌 Interview Tips

1An API serving 10,000 requests per second across 50 servers needs distributed rate limiting to enforce a global 100 requests per second limit per API key. Each server would locally see only 200 requests per second on average, making local only enforcement impossible.

2Google Cloud Endpoints combines user ID, project ID, and API method to create composite rate limit keys, enabling limits like "1000 requests per minute per user per API method" while protecting shared backend services.

← Back to Distributed Rate Limiting Overview