Sizing Parameters: Leak Rate r and Bucket Capacity B

The Two Parameters
Every Leaky Bucket has two parameters. The leak rate r determines how many requests leave per second (throughput). The bucket capacity B determines how many can queue before overflow (burst tolerance). Get r wrong and you overload downstream or waste capacity. Get B wrong and you reject too many requests or create unacceptable latency.
Sizing the Leak Rate (r)
The leak rate r should match what downstream can sustainably handle. If PostgreSQL handles 5,000 queries/sec at healthy latency, set r = 5,000. If your ML model processes 10 inferences/sec, set r = 10. Do not set r based on what you wish downstream could handle. Measure actual sustainable throughput. Set r to 80% of measured capacity for headroom.
Sizing the Bucket Capacity (B)
Bucket capacity B determines max queue depth. The formula max wait = B / r tells you worst case latency. With B = 500 and r = 100/sec, a request at the back waits 5 seconds. For user facing APIs, keep max wait under 1 second. For background jobs, 10 to 60 seconds might be acceptable. For ML inference, match B to batch processing time.
The Burst Absorption Trade-off
Larger B absorbs bigger bursts but increases worst case latency. If traffic spikes from 80/sec to 200/sec for 5 seconds, you need B = (200 - 100) × 5 = 500 to avoid rejection (r equals 100/sec). But worst case wait is then 5 seconds. If users cannot tolerate this, smaller B with some rejections might be better.
Production Tuning Process
Start with r at 80% of downstream capacity and B at 1 second worth of r. Monitor rejection rate (under 1%), queue depth (rarely hit B), and downstream latency (stay healthy). If rejection is high during legitimate spikes, increase B. If queue is always near B, increase r or add capacity.

💡 Key Takeaways

✓Set leak rate r to 70 to 90 percent of sustainable backend capacity to absorb variance; example: 1,000 QPS database capacity means r=700 to 800 rps leaves headroom for spikes and background load

✓Bucket capacity B must satisfy B ≤ r × (minimum timeout) or requests timeout while queued; r=500 rps with 2 second timeout requires B ≤ 1,000

✓Maximum queueing delay is B/r; NGINX production configs use r=5 rps per IP with B=25 giving 5 second max delay to shed bot traffic while protecting origin p95 latency

✓Multi tenant systems need per principal buckets with weighted fair scheduling to prevent one tenant monopolizing the leak and starving others of their allocated throughput

✓Measure under realistic key distributions: hot keys serialize updates and reduce throughput; if each limiter node handles 200k checks per second, a 5 node shard sustains roughly 1M checks per second with redundancy

📌 Interview Tips

1Uber client library pacing 500 QPS means 2 millisecond spacing between requests; 1,000 in flight buffer caps worst case queueing at 2 seconds, preventing downstream tail latency spikes during retry storms

2Unsafe configuration: r=200 rps, B=2,000, timeout=3 seconds gives max delay of 10 seconds. Most queued requests fail, wasting capacity and amplifying retries. Correct sizing: B ≤ 600

← Back to Leaky Bucket Algorithm Overview