Implementation Patterns: Sharding, Caching, and Operational Strategies
Making It Work in Production
Theory is nice; here is how to actually implement distributed rate limiting that survives real world traffic. These patterns come from production systems handling millions of requests per second.
Minimize Round Trips
Every network call adds latency and failure risk. Design for one atomic operation per request. For token bucket in Redis, use a Lua script that reads tokens, computes refill, subtracts request cost, and returns decision in one call. For sliding window, pipeline the two reads (current and previous window) into a single round trip. Target: 0.5 to 1 ms rate limiting overhead per request.
Sharding Strategy
Single Redis instance handles 100K to 200K ops/sec. Beyond that, shard by user ID hash across multiple instances. With 4 Redis instances, user_id % 4 determines which instance handles each user. This also isolates hot keys: one viral user only affects one shard. For global limits (total API rate), designate one instance as the global counter or use distributed counters with eventual consistency.
Local Caching Layer
Before checking Redis, check a local in memory cache. If a user is already over limit (cached denial), skip the Redis call entirely. Cache grants for 100 to 500 ms. This reduces Redis load by 50 to 80% for hot keys. Trade-off: cached grants can overshoot by the cache duration times arrival rate. With 100ms cache and 1,000/sec arrival, you might grant 100 extra requests per server.
Operational Strategies
Monitor: track denial rate, Redis latency, cache hit rate. Alert on sustained denial rate above 5% (indicates capacity issue or attack). Graceful degradation: if Redis latency exceeds 10 ms, fall back to local limits. Rate limit headers: always return X-RateLimit-Remaining and Retry-After so clients can back off gracefully. These practices prevent rate limiting from becoming the cause of outages rather than the solution.