Rate LimitingDistributed Rate LimitingMedium⏱️ ~3 min

Trade-offs: Accuracy vs Performance and Consistency vs Availability

The accuracy versus performance trade-off centers on computational cost per decision. Fixed windows require one atomic increment and are cheapest but overshoot at boundaries. Approximated sliding windows need two counter reads and a weighted sum calculation, improving accuracy but doubling storage operations. Token bucket provides smooth burst control but requires atomic read modify write with time based refill logic, demanding stronger consistency guarantees to prevent double spending where concurrent requests both consume the same token. Consistency versus availability presents a harder choice. Strict global consistency using a single leader or strongly consistent operations prevents any limit overruns but increases tail latency and creates sensitivity to network partitions. When the shared store is unavailable, rate limiting must either fail closed (reject all requests to protect backend, risking false negatives) or fail open (allow all requests to preserve availability, risking backend overload). Most production systems choose fail open with circuit breakers and backend side protection as a safety net. Centralized versus decentralized enforcement trades global visibility for latency. A centralized store in one region ensures all servers see the same state but adds cross region round trip time (RTT) for geographically distributed systems. Google and Amazon enforce rate limits at edge locations close to users, then synchronize state centrally. Decentralized approaches using local approximations with periodic reconciliation reduce per request latency but tolerate temporary discrepancies, which is acceptable when limits are soft guardrails rather than hard security boundaries. The choice depends on your risk profile. Use distributed rate limiting with strong consistency when you must enforce tenant fairness in multi tenant systems or protect shared backends with strict quotas. Prefer local or load balancer based limiting when you primarily want overload protection, coarse per node fairness suffices, and ultra low latency is paramount.
💡 Key Takeaways
Token bucket requires atomic read modify write to prevent double spend: without atomicity, two concurrent requests can both read 1 remaining token and both decrement it, allowing 2 requests through
Fail open strategies preserve availability during store outages but risk backend overload; fail closed protects backends but creates false denials. Most systems fail open with backend circuit breakers as secondary defense
Cross region enforcement centralizing in one region adds round trip time overhead (typically 50 to 200 milliseconds across continents), making per region budgets with periodic reconciliation more practical for global systems
Approximated sliding windows double storage operations compared to fixed windows (reading two counters instead of one) but reduce boundary burst error from 100% overshoot to under 1% when properly tuned
📌 Examples
Amazon API Gateway enforces rate limits at edge locations (fail open on store unavailability) with backend throttling as secondary protection. This tolerates brief overruns during outages while preventing cascading failures to origin servers.
A global API with 100 requests per minute limit can allocate 40 to US region, 40 to Europe region, and 20 to Asia region as independent budgets. Worst case overrun is capped at 100 instead of 300 if regions were completely independent, while avoiding cross region synchronization latency on every request.
← Back to Distributed Rate Limiting Overview
Trade-offs: Accuracy vs Performance and Consistency vs Availability | Distributed Rate Limiting - System Overflow