Rate Limiting • Token Bucket AlgorithmMedium⏱️ ~3 min
Hierarchical Token Buckets and Multi-Scope Fairness
Single scope token buckets are simple but permit noisy neighbor problems where one tenant or user exhausts shared capacity. Hierarchical buckets compose multiple independent buckets at different scopes (per user, per tenant, global) to enforce fairness and guardrails simultaneously. A request must pass all relevant bucket checks to be admitted.
For example, consider a multi-tenant API with three levels: per user bucket (r = 100/sec, b = 50), per tenant bucket (r = 1,000/sec, b = 200), and global bucket (r = 10,000/sec, b = 1,000). A request from user Alice in tenant Acme checks Alice's bucket first, then Acme's tenant bucket, then the global bucket. If any bucket lacks tokens, the request is denied. This prevents Alice from monopolizing Acme's quota, prevents Acme from starving other tenants, and protects the overall system from aggregate overload.
The cost is multiple checks per request, increasing tail latency. To mitigate, perform checks in parallel and short circuit on first rejection. For remote or global scopes, consider soft admission with post factum accounting when risk is low. Envoy implements this by chaining local rate limiting filters: evaluate per Internet Protocol (IP) bucket and per API key bucket sequentially, adding roughly 100 microseconds total per request with both checks.
Amazon API Gateway uses hierarchical enforcement with account level, per API, and per stage throttles. Each scope has independent rate and burst parameters. A single API within an account cannot exceed its per API limit even if the account level limit has headroom, ensuring isolation across services. Typical configurations set account level r = 10,000/sec, per API r = 5,000/sec, and per stage r = 1,000/sec, progressively narrowing scope to balance flexibility and protection.
💡 Key Takeaways
•Hierarchical buckets prevent noisy neighbor by enforcing limits at multiple scopes. A per user bucket (r = 100/sec) stops one user from monopolizing tenant quota, while per tenant bucket (r = 1,000/sec) prevents tenant from starving others.
•Admission requires passing all relevant bucket checks. If any scope (user, tenant, global) lacks tokens, reject immediately. Perform checks in parallel with short circuit logic to minimize tail latency impact.
•Multiple checks add latency proportionally: two local checks add roughly 200 to 400 nanoseconds, acceptable for most workloads. Remote checks add milliseconds; cache results or use soft admission with async reconciliation when precision is less critical.
•Amazon API Gateway implements account, per API, and per stage hierarchies with independent rate and burst. Typical settings: account r = 10,000/sec, per API r = 5,000/sec, per stage r = 1,000/sec, progressively narrowing to ensure isolation.
•State cardinality explodes multiplicatively: with 1,000 tenants and 10,000 users per tenant, per user buckets require 10 million entries. Use lazy initialization, Time-To-Live (TTL) expiration, and bloom filters to track active sets efficiently.
📌 Examples
Envoy chaining per Internet Protocol (IP) bucket (r = 1,000/sec, b = 200) and per API key bucket (r = 500/sec, b = 100) protects against both volumetric Distributed Denial of Service (DDoS) and credential abuse. Total decision overhead under 100 microseconds for both checks.
Multi-tenant Software as a Service (SaaS) platform sets per user r = 50/sec, per tenant r = 5,000/sec, global r = 100,000/sec. During Black Friday spike, tenant acme hits tenant limit at 5,000/sec while other tenants continue unaffected, proving isolation.
AWS Lambda throttling uses function level and account level buckets. A single runaway function hitting its concurrency limit (e.g., 1,000 concurrent executions) does not impact other functions in the account, which have independent allocations.