Rate LimitingDistributed Rate LimitingHard⏱️ ~3 min

Implementation Patterns: Sharding, Caching, and Operational Strategies

Effective implementations minimize remote operations to one atomic read modify write per request. For sliding windows, store aggregated counters instead of per event timestamps to avoid memory explosion and expensive prune operations at high QPS. Shard keys using a stable hash function to distribute hot tenants across storage partitions, ensuring partition topology aligns with your storage system's internal sharding. For example, if using Redis Cluster with 16,384 hash slots, choose a hash that maps evenly across slots to prevent unbalanced load. Local caching of token leases dramatically reduces store pressure for very hot keys. A server can pre fetch 50 tokens from the shared store, serve 50 requests locally without remote calls, then fetch another batch. This reduces store operations by 50 times but introduces bounded inaccuracy: if 10 servers each cache 50 tokens and then crash, you lose 500 tokens of quota. Cap lease size based on acceptable overrun risk. Amazon and Google APIs likely use similar lease based optimizations internally, though exact implementation details are not publicly documented. Operational strategies include dark launch modes where the limiter runs in observe only mode, logging decisions without enforcement. Compare decisions against ground truth to calibrate error and latency impact before enabling enforcement. Define service level objectives (SLOs) such as p99 decision latency under 3 milliseconds in region and under 0.1% decision errors. Monitor decision latency at all percentiles, store saturation (CPU, memory, operations per second), proportion of denied versus allowed requests, near limit utilization patterns, key skew distribution, and error rates. Roll out configuration changes per tenant or endpoint gradually, watching for traffic redistribution effects where new limits on one endpoint shift load to others.
💡 Key Takeaways
Local token caching reduces store operations by 50 times: fetching 50 tokens once and serving 50 requests locally versus 50 individual remote calls, at the cost of bounded overrun if servers crash with cached tokens
Shard keys using stable hash functions aligned with storage topology; for Redis Cluster with 16,384 hash slots, ensure even distribution to prevent partition hotspots that degrade p99 latency
Dark launch with observe only mode lets you measure decision latency and accuracy against ground truth before enforcement; Criteo reported this as critical for validating behavior under real production load patterns
Service level objectives should include p99 decision latency under 3 milliseconds for in region calls and error rate under 0.1%, monitored continuously with alerts on degradation
📌 Examples
Token lease implementation: App server calls DECRBY rate_limit_key 50. Store returns new count. If successful, server increments local counter from 0 to 50, serves requests by decrementing local counter. When local counter reaches 0, fetch next batch. If server crashes with 30 tokens remaining, those 30 are lost (acceptable bounded error).
Monitoring dashboard tracks: decision latency histogram (p50 equals 1.2 milliseconds, p99 equals 2.8 milliseconds, p999 equals 15 milliseconds), store CPU at 45%, denied request rate at 2.3%, top 10 keys by volume showing one key consuming 15% of total quota (potential hotspot requiring investigation).
← Back to Distributed Rate Limiting Overview
Implementation Patterns: Sharding, Caching, and Operational Strategies | Distributed Rate Limiting - System Overflow