Failure Modes: Hotspot Keys, Clock Skew, and Race Conditions
How Distributed Rate Limiting Breaks
Centralized coordination introduces failure modes that do not exist in local rate limiting. Understanding these helps you design resilient systems. These are not edge cases: they cause real production incidents at scale.
Hot Key Problem
One popular user or viral content creates a hotspot. If user X generates 50,000 requests/sec, you are doing 50,000 Redis operations on a single key (user:X:counter). Single Redis thread handles ~100K ops/sec; one hot key consumes half your capacity. Mitigation: shard counters by key hash across multiple Redis instances, use local caching with periodic sync, or implement request coalescing for hot keys.
Clock Skew
Fixed window boundaries depend on time. Server A at 14:00:00 and Server B at 13:59:58 disagree about which window a request belongs to. User hits both and gets double the rate. NTP keeps servers within 10 ms usually, but edge cases (VM migration, network partition, leap seconds) can cause larger drift. Mitigation: use Redis server time (single source of truth), or accept that window boundaries have 100 ms fuzziness.
Race Conditions
Two requests arrive simultaneously, both read counter at 99, both increment to 100, both proceed, actual count is now 101. Under high concurrency, this "double spending" can let through 2x to 10x your intended rate. Mitigation: use atomic operations. In Redis, Lua scripts execute atomically: read, check, increment, return in one indivisible operation. Never do read then decide then write as separate commands.
Redis Failures
Redis goes down: what happens? Block all requests (strict enforcement, poor availability) or allow all requests (good availability, no rate limiting). Common compromise: fall back to local per server limits. You lose global accuracy but maintain protection. With 10 servers and 1,000/sec global limit, local fallback of 100/sec per server gives 1,000/sec total if traffic is even, up to 1,000/sec on one server if traffic is skewed. Log these incidents for post mortem.