Partitioning & Sharding • Hotspot Detection & HandlingHard⏱️ ~3 min
Fairness, Rate Limiting, and Load Shedding to Protect Against Hotspot Damage
Even with caching, rebalancing, and sharding, some hotspots cannot be fully eliminated: a single user generating abusive traffic, a bot scraping millions of pages, or a viral event concentrating load on one entity. The last line of defense is fairness mechanisms, per key rate limiting, and selective load shedding that prevent a few hot keys from monopolizing shared resources and degrading experience for everyone else. These techniques trade off peak utilization (leaving some capacity idle when one tenant is throttled) for predictable tail latency and multi tenant fairness. The goal is to fail gracefully under extreme skew rather than cascading into total outage.
Per key token bucket rate limits enforce a maximum Requests Per Second (RPS) cap for any individual key, such as 500 RPS per user ID or 1,000 RPS per object ID. This bounds worst case load even if a single key goes viral or is attacked. Token buckets allow short bursts by accumulating credits during idle periods, providing good user experience for legitimate spiky traffic, but drain quickly to prevent sustained abuse. Implementation must be efficient: in memory token buckets per key with Least Recently Used (LRU) eviction for cold keys, or distributed rate limiters using Redis or similar with sub millisecond latency. The trade off is that enforcing strict per key limits may leave aggregate capacity idle; if one user is throttled at 500 RPS but the backend can sustain 10,000 RPS total, you are leaving 9,500 RPS unused. Adaptive approaches allow burst borrowing where keys can exceed their base quota if aggregate load is below capacity, with strict caps to prevent monopolization.
Queue based fairness and weighted fair queuing prevent head of line blocking where one hot flow starves others. In systems like Amazon SQS First In First Out (FIFO) queues, a single hot message group is processed serially and can block throughput for other groups. Mitigation includes using multiple message groups per logical entity (sharded groups) if strict total order is not required, or implementing per key queues with weighted scheduling at the application layer so hot keys get their fair share but cannot consume all processing threads. Load shedding with circuit breakers is the ultimate fallback: when a partition or service reaches saturation despite all mitigations, selectively drop requests with backpressure signals (HTTP 503 Service Unavailable, gRPC UNAVAILABLE status) to prevent queue buildup and cascading timeouts. Graceful degradation strategies include serving stale cached data for non critical reads, returning partial results, or rejecting lowest priority traffic (for example, analytics or background jobs) before impacting user facing requests. Monitoring must track shed rate by reason and key to identify whether load shedding is protecting the system or indicating a capacity planning failure.
💡 Key Takeaways
•Per key token bucket rate limits cap individual keys at maximums like 500 RPS per user or 1,000 RPS per object, bounding worst case load and allowing short bursts via accumulated credits while draining quickly for sustained abuse
•Trade off of strict per key limits: may leave aggregate capacity idle if one user throttled at 500 RPS while backend sustains 10,000 RPS total; adaptive burst borrowing allows exceeding base quota when aggregate load is below capacity
•Queue based fairness and weighted fair queuing prevent head of line blocking; in SQS FIFO, a single hot message group processes serially and blocks others, requiring sharded groups or per key queues with weighted scheduling
•Load shedding with circuit breakers selectively drops requests (HTTP 503, gRPC UNAVAILABLE) when saturation occurs despite mitigations, preventing queue buildup and cascading timeouts across dependent services
•Graceful degradation strategies: serve stale cached data for non critical reads, return partial results, or reject lowest priority traffic (analytics, background jobs) before impacting user facing requests
📌 Examples
Reddit API enforces 600 requests per 10 minutes per OAuth client (1 RPS average); token bucket allows bursts to 10 RPS for 60 seconds, then throttles; bots hitting this limit get HTTP 429 while other users are unaffected
Amazon SQS FIFO queue with single message group for order ID processes serially at roughly 300 messages/s; migrating to 10 sharded message groups per order (order_id:0 through order_id:9) achieves 3,000 messages/s by parallelizing
A content API implements circuit breakers per endpoint; when celebrity profile requests saturate backend at 50,000 QPS, circuit opens and serves stale 10 second cached profile to 90% of requests while 10% fetch fresh data, keeping backend under 5,000 QPS