Rate Limiting • Token Bucket AlgorithmMedium⏱️ ~3 min
Production Deployment: Local vs Distributed Token Buckets
Deploying token buckets in production requires choosing between local (per instance) and distributed (global coordination) enforcement, a decision that fundamentally trades accuracy for latency and availability.
Local token buckets keep all state in process memory with atomic counters. Each service instance independently enforces its own bucket, typically allocating global_rate divided by N instances. For example, with a global limit of 8,000 RPS across 4 frontends, each instance gets r = 2,000 tokens/sec and b = 500. Decision latency is under 1 microsecond, and there is no coordination dependency or Single Point of Failure (SPOF). The tradeoff is overshoot during traffic imbalance or failover: if one instance fails, remaining instances continue at their configured rates, potentially exceeding the global target by their local burst allocations. Worst case overshoot equals sum of all local b values during partition windows.
Distributed token buckets maintain per-key state in a strongly consistent store (Redis, etcd, or similar) with atomic read-modify-write operations. This provides exact global enforcement but adds 0.3 to 5 milliseconds network Round-Trip Time (RTT) per check within a datacenter. At 50,000 requests per second, a single coordination node saturates, requiring key based sharding and colocation with the request path. Many systems adopt a hybrid: cache token "leases" locally (e.g., grab 200 tokens from global bucket, enforce locally until exhausted) to amortize coordination cost while bounding overshoot.
Stripe and Salesforce effectively use distributed buckets with lease patterns. Stripe documents roughly 100 RPS steady with small burst allowance per account, enforced globally to prevent one datacenter from consuming another's quota. Integrators report that staying at 80 to 100 RPS with jittered retries keeps 429 induced tail latency under 50 milliseconds. Salesforce customers deploy local token bucket frontends (r = 200 RPS, b = 1,000) to shape spikes before hitting Salesforce concurrency caps, stabilizing workflow latency in the 200 to 500 millisecond range under load.
💡 Key Takeaways
•Local buckets provide sub-microsecond decisions with zero coordination but can overshoot global limits by sum of local burst allocations during failover. With 4 instances each having b = 500, worst case overshoot is 2,000 tokens during partition windows.
•Distributed buckets with strongly consistent stores add 0.3 to 5 milliseconds latency per check and create coordination Single Points of Failure (SPOF). At 50,000 RPS, a single Redis node saturates, requiring key based sharding and careful capacity planning.
•Hybrid lease pattern amortizes coordination cost: grab N tokens from global bucket, enforce locally until exhausted. Grabbing 200 token leases reduces global store queries per second (QPS) by 200× while bounding overshoot to lease_size × instance_count.
•AWS API Gateway uses local enforcement per edge location with periodic reconciliation, achieving negligible latency overhead while accepting slight overshoot during traffic shifts. Rejection paths complete in single digit milliseconds end to end.
•Memory scales linearly with identity cardinality. With 10 million active keys at 64 bytes per key, you need roughly 640 megabytes just for counters. Use Time-To-Live (TTL) and active set tracking to bound growth, but beware cold key reinitialization bursts.
📌 Examples
Kubernetes client-go allocates local buckets per client with QPS = 5, burst = 10. With 50 controllers per node, aggregate naturally smooths without centralized state, preventing API server overload while maintaining high availability.
Stripe enforces roughly 100 RPS per account globally using distributed coordination. Merchants staying at 80 to 100 RPS with jittered retries avoid throttles, with 429 induced tail latency under 50 milliseconds in most cases.
Salesforce integrators deploy local token buckets (r = 200 RPS, b = 1,000) at edge to shape spikes into Salesforce concurrency caps. This prevents downstream 5xx bursts and stabilizes average workflow latency to 200 to 500 milliseconds under load.