Sizing and Tuning: Choosing r and b with Real Numbers

Working Backward from Downstream Capacity
Never choose token bucket parameters based on intuition. Start with what your system can actually handle. If your database sustains 8,000 queries/sec comfortably and tolerates 1 second bursts of 2,000 extra requests, your token bucket should reflect that: r = 8,000/sec, b = 2,000. The formula max requests = r × T + b tells you exactly what happens: a user idle for 10 seconds can immediately burst 2,000 requests, then sustain 8,000/sec.
The Burst Sizing Rule
Production experience converges on a consistent guideline: set b between 0.5 and 1.0 seconds worth of r. With r = 1,000/sec, that means b = 500 to 1,000. This allows natural request clustering without enabling dangerous spikes. Going beyond 2 seconds worth of r rarely helps legitimate users but increases downstream overload risk.
Weighted Tokens for Variable Request Costs
Not all requests impose equal cost. A complex analytics query might stress your database 5× more than a simple lookup. Assign weights: light query costs 1 token, heavy query costs 5 tokens. With r = 1,000 tokens/sec, you can sustain 1,000 light queries OR 200 heavy queries per second. The bucket naturally throttles expensive operations harder while being generous with cheap ones.
Multi Instance Budget Division
When running multiple frontend servers, decide how to divide the global limit. With 4 servers and global r = 8,000, allocate r = 2,000 per server. For exact global limits with uneven traffic, use distributed buckets with leases: each server requests token chunks (200 at a time) from a central store, reducing coordination by 200×.
Production Tuning Process
Start conservative with lower r and smaller b, then monitor rejection rates. If you see frequent 429s during normal usage, increase b for burst tolerance. If downstream shows strain, decrease b. Healthy systems reject under 1% of requests under normal load. Monitor token levels: persistently near zero means demand exceeds capacity. Alert on sustained deny rates above 1 to 5%.

💡 Key Takeaways

✓Start with downstream sustained capacity for r and burst headroom for b. If database handles 8,000 QPS sustained with 2,000 extra for 1 second bursts, set r = 8,000 tokens/sec and b = 2,000 to match exactly.

✓Size b between 0.5 and 1.0 seconds of r as default. This balances user experience (natural request clustering) and protection (preventing connection pool exhaustion). AWS API Gateway typically uses b = 1 to 2× r, rarely exceeding 2 seconds worth.

✓Weighted tokens enable capacity aware limiting. Assign w = 5 tokens to heavy queries and w = 1 to light lookups. With r = 1,000 tokens/sec, sustained heavy traffic achieves 200 requests/sec while light traffic reaches 1,000 requests/sec.

✓For multi-instance deployments, either divide global budget by instance count or use distributed enforcement with local leases. With 4 frontends and global r = 8,000, allocate each r = 2,000 locally, or use 200 token leases to reduce coordination queries per second (QPS) by 200×.

✓Monitor deny rate and token levels continuously. Sustained deny rate above 1 to 5% or tokens persistently near zero indicates demand exceeds capacity. Alert and provision more capacity or tighten client side batching to reduce arrival rate.

📌 Interview Tips

1Database connection pool with 100 connections, each handling 80 QPS, provides 8,000 QPS sustained. Set r = 8,000 tokens/sec. Pool can absorb 20% overshoot (2,000 extra requests) for 1 second without queueing, so set b = 2,000.

2API serving mixed workload: 80% light reads (50 ms latency, w = 1 token) and 20% heavy writes (500 ms, w = 10 tokens). Set r = 1,000 tokens/sec. Effective light read capacity is roughly 800 requests/sec if all light, or 90 requests/sec if all heavy.

3Stripe per account limit of roughly 100 requests/sec with sub-second burst. Merchants smoothing to 80 to 100 requests/sec with jittered retry see 429 rate under 1%, with retry induced p99 latency under 50 milliseconds.

← Back to Token Bucket Algorithm Overview