Rate LimitingDistributed Rate LimitingHard⏱️ ~3 min

Failure Modes: Hotspot Keys, Clock Skew, and Race Conditions

Hotspot keys occur when a small set of identifiers (like a popular tenant or viral content) creates write contention in the request store. Symptoms include rising p99 latency, timeouts, and paradoxically allowing over limit requests due to retry races where failed operations get retried and counted multiple times. At extreme scale, a single key receiving 100,000 requests per second can saturate a single storage partition. Mitigation strategies include per key sharding using consistent hashing, local token caches with small time to live (TTL) values (allowing each server to cache 10 to 50 tokens locally for 100 milliseconds), or partitioning budgets where the global limit is split across servers. Clock skew and time drift corrupt time based algorithms. Sliding windows and token bucket refill depend on accurate timestamps. If one server's clock runs 5 seconds fast, it will refill tokens prematurely and allow excess traffic. If another runs slow, it will unfairly deny requests. Mitigation requires clock synchronization via Network Time Protocol (NTP) or Precision Time Protocol (PTP), using the storage system's clock for refill calculations instead of application server clocks, and adding small buffers in boundary calculations to tolerate minor drift. Race conditions and double spend happen without atomic operations. Two concurrent requests reading count equals 99 against a limit of 100 can both increment to 100 and both succeed, allowing 101 total requests. With weak consistency or reading from replicas to reduce latency, replication lag causes stale counter reads and temporary overruns. Criteo's evaluation across multiple data centers emphasized that accuracy degrades under the worst 1% of load when atomicity is weak. Solutions include atomic read modify write primitives (like Redis INCR or compare and set operations), optimistic locking with retry, and treating duplicate requests as idempotent using request IDs.
💡 Key Takeaways
Hotspot keys at 100,000 requests per second can saturate single storage partitions; local token caches with 100 millisecond TTL holding 10 to 50 tokens per server reduce store pressure by 90% while capping worst case overrun to token cache size
Clock skew of 5 seconds causes token bucket to refill prematurely or delay unfairly; using storage system timestamps for refill instead of application server clocks eliminates this class of errors
Race conditions without atomic operations allow double spend: two requests reading count equals 99 both increment to 100, permitting 101 requests against a limit of 100
Replication lag when reading from followers causes temporary overruns; reading from primary for hot keys or accepting small error margins with eventual consistency are the two practical approaches
📌 Examples
A viral API key generates 200,000 requests per second globally. Without per key sharding, this concentrates on one storage partition. Splitting the key across 10 partitions (each handling 20,000 requests per second) and allocating 10% of the limit per partition keeps each partition under capacity.
Redis INCR command provides atomic increment returning the new value. Pseudocode: new_count equals INCR rate_limit_key. If new_count is greater than limit, DECR rate_limit_key and return deny. This prevents double spend without requiring transactions or locking.
← Back to Distributed Rate Limiting Overview
Failure Modes: Hotspot Keys, Clock Skew, and Race Conditions | Distributed Rate Limiting - System Overflow