Failure Modes: Hot Keys, Clock Skew, and State Explosions
How Window Based Rate Limiting Fails
Understanding failure modes helps you choose the right algorithm and set appropriate limits. Window based approaches have specific vulnerabilities that emerge under production load, malicious traffic, or system misconfiguration. These are not theoretical: they cause real outages.
Key Explosion Attack
An attacker rotates through millions of unique identifiers (IPs, usernames, API keys). If you allocate a counter per key, memory grows linearly with attackers: 1 million keys times 16 bytes equals 16 MB just for rate limit state. With sliding log storing timestamps, it is much worse: 1 million keys times 1,000 requests times 8 bytes equals 8 GB. Mitigation: use approximate data structures (count min sketch, bloom filters) or hierarchical limits (global limit catches key explosion even if per key tracking fails).
Clock Skew Issues
Fixed window boundaries depend on server time. If Server A thinks it is 2:00:00 and Server B thinks it is 1:59:58, they disagree about which window a request belongs to. User hits both servers and gets double the intended rate. NTP typically keeps servers within 10 ms, but edge cases (network partition, VM migration, leap seconds) can cause larger drift. Mitigation: use central time source (Redis server time) or accept some imprecision.
Hot Key Problem
One viral API key or popular user generates 1,000x normal traffic. All requests for that key hit the same rate limit counter, creating a hot spot in Redis. At 50,000 requests/sec to one key, you are doing 50,000 INCR operations on one Redis key, exceeding single thread capacity. Mitigation: shard by key across multiple Redis instances, use local caching with periodic sync, or implement adaptive throttling that rejects hot keys earlier in the pipeline.
State Explosion from Long Windows
Sliding log with 24 hour window storing timestamps for 1,000 requests/day limit needs to store up to 1,000 timestamps per user for 24 hours. With 100,000 users, that is potentially 800 MB of rate limit state. Mitigation: use hybrid approach with fine grained sliding window for short term (1 minute) and fixed window for long term (daily), combining fairness with manageable memory.