Implementation Patterns: Sharding, Caching, and Operational Strategies

Making It Work in Production
Theory is nice; here is how to actually implement distributed rate limiting that survives real world traffic. These patterns come from production systems handling millions of requests per second.
Minimize Round Trips
Every network call adds latency and failure risk. Design for one atomic operation per request. For token bucket in Redis, use a Lua script that reads tokens, computes refill, subtracts request cost, and returns decision in one call. For sliding window, pipeline the two reads (current and previous window) into a single round trip. Target: 0.5 to 1 ms rate limiting overhead per request.
Sharding Strategy
Single Redis instance handles 100K to 200K ops/sec. Beyond that, shard by user ID hash across multiple instances. With 4 Redis instances, user_id % 4 determines which instance handles each user. This also isolates hot keys: one viral user only affects one shard. For global limits (total API rate), designate one instance as the global counter or use distributed counters with eventual consistency.
Local Caching Layer
Before checking Redis, check a local in memory cache. If a user is already over limit (cached denial), skip the Redis call entirely. Cache grants for 100 to 500 ms. This reduces Redis load by 50 to 80% for hot keys. Trade-off: cached grants can overshoot by the cache duration times arrival rate. With 100ms cache and 1,000/sec arrival, you might grant 100 extra requests per server.
Operational Strategies
Monitor: track denial rate, Redis latency, cache hit rate. Alert on sustained denial rate above 5% (indicates capacity issue or attack). Graceful degradation: if Redis latency exceeds 10 ms, fall back to local limits. Rate limit headers: always return X-RateLimit-Remaining and Retry-After so clients can back off gracefully. These practices prevent rate limiting from becoming the cause of outages rather than the solution.

💡 Key Takeaways

✓Local token caching reduces store operations by 50 times: fetching 50 tokens once and serving 50 requests locally versus 50 individual remote calls, at the cost of bounded overrun if servers crash with cached tokens

✓Shard keys using stable hash functions aligned with storage topology; for Redis Cluster with 16,384 hash slots, ensure even distribution to prevent partition hotspots that degrade p99 latency

✓Dark launch with observe only mode lets you measure decision latency and accuracy against ground truth before enforcement; Criteo reported this as critical for validating behavior under real production load patterns

✓Service level objectives should include p99 decision latency under 3 milliseconds for in region calls and error rate under 0.1%, monitored continuously with alerts on degradation

📌 Interview Tips

1Token lease implementation: App server calls DECRBY rate_limit_key 50. Store returns new count. If successful, server increments local counter from 0 to 50, serves requests by decrementing local counter. When local counter reaches 0, fetch next batch. If server crashes with 30 tokens remaining, those 30 are lost (acceptable bounded error).

2Monitoring dashboard tracks: decision latency histogram (p50 equals 1.2 milliseconds, p99 equals 2.8 milliseconds, p999 equals 15 milliseconds), store CPU at 45%, denied request rate at 2.3%, top 10 keys by volume showing one key consuming 15% of total quota (potential hotspot requiring investigation).

← Back to Distributed Rate Limiting Overview