Rate Limiting • Token Bucket AlgorithmMedium⏱️ ~3 min
Sizing and Tuning: Choosing r and b with Real Numbers
Choosing refill rate r and burst capacity b requires understanding downstream tolerance and user experience goals. Start with downstream sustained capacity and short term burst headroom, then work backward.
Suppose a database tier comfortably sustains 8,000 queries per second (QPS) and can tolerate 1 second micro-bursts of 2,000 extra requests without queueing or tail latency degradation. Set r = 8,000 tokens/sec and b = 2,000. Any T second interval from cold start allows at most 8,000·T + 2,000 requests. A client idle for 10 seconds can immediately send 2,000 requests as burst, then sustain 8,000 per second. If you run 4 frontend instances, allocate each r = 2,000 tokens/sec and b = 500 to distribute the global budget, or use a distributed bucket with small local leases (e.g., 200 tokens per instance) to reduce coordination overhead.
Weighted tokens handle heterogeneous request costs. If a heavy analytical query consumes 5× resources of a light lookup, assign w = 5 tokens to heavy and w = 1 to light. With r = 1,000 tokens/sec and b = 1,000, two heavy requests consume 10 tokens. Sustained heavy-only traffic is limited to 200 requests/sec (1,000 tokens / 5 tokens per request), while light-only traffic achieves 1,000 requests/sec. This naturally prioritizes cheap operations and throttles expensive ones proportionally.
AWS API Gateway typically defaults to r in the 10,000 requests/sec range per Region with b around 1 to 2× of r, providing 1 to 2 seconds of burst headroom. Stripe enforces roughly r = 100 requests/sec per account with b sized for sub-second bursts. Production telemetry shows that sizing b between 0.5 and 1.0 seconds of r balances user experience (allowing natural request clustering) and downstream protection (preventing resource exhaustion). Going beyond 2 seconds of r for b increases risk of connection pool saturation and retry storms with minimal user benefit.
💡 Key Takeaways
•Start with downstream sustained capacity for r and burst headroom for b. If database handles 8,000 QPS sustained with 2,000 extra for 1 second bursts, set r = 8,000 tokens/sec and b = 2,000 to match exactly.
•Size b between 0.5 and 1.0 seconds of r as default. This balances user experience (natural request clustering) and protection (preventing connection pool exhaustion). AWS API Gateway typically uses b = 1 to 2× r, rarely exceeding 2 seconds worth.
•Weighted tokens enable capacity aware limiting. Assign w = 5 tokens to heavy queries and w = 1 to light lookups. With r = 1,000 tokens/sec, sustained heavy traffic achieves 200 requests/sec while light traffic reaches 1,000 requests/sec.
•For multi-instance deployments, either divide global budget by instance count or use distributed enforcement with local leases. With 4 frontends and global r = 8,000, allocate each r = 2,000 locally, or use 200 token leases to reduce coordination queries per second (QPS) by 200×.
•Monitor deny rate and token levels continuously. Sustained deny rate above 1 to 5% or tokens persistently near zero indicates demand exceeds capacity. Alert and provision more capacity or tighten client side batching to reduce arrival rate.
📌 Examples
Database connection pool with 100 connections, each handling 80 QPS, provides 8,000 QPS sustained. Set r = 8,000 tokens/sec. Pool can absorb 20% overshoot (2,000 extra requests) for 1 second without queueing, so set b = 2,000.
API serving mixed workload: 80% light reads (50 ms latency, w = 1 token) and 20% heavy writes (500 ms, w = 10 tokens). Set r = 1,000 tokens/sec. Effective light read capacity is roughly 800 requests/sec if all light, or 90 requests/sec if all heavy.
Stripe per account limit of roughly 100 requests/sec with sub-second burst. Merchants smoothing to 80 to 100 requests/sec with jittered retry see 429 rate under 1%, with retry induced p99 latency under 50 milliseconds.