When to Use Leaky Bucket: Trade-offs and Alternatives
When Leaky Bucket Wins
Choose leaky bucket when your downstream is fragile and cannot tolerate any burst at all. Databases, ML inference services, and search engines often have this property. They maintain complex internal state (locks, buffer pools, query plans) that gets corrupted or slow under sudden load. A 2x traffic spike lasting just 1 second sounds harmless, but for a database that spike can push p99 latency from 20ms to 500ms and take minutes to recover as lock contention cascades.
When Token Bucket is Better
User facing APIs often need burst responsiveness. Loading a dashboard might trigger 50 API calls in 100ms. Token bucket lets this through instantly (if tokens are available), while leaky bucket queues them for 500ms even if downstream could handle the burst. For services that scale horizontally and handle bursts gracefully, token bucket provides better user experience.
Hybrid Patterns
Real systems often combine algorithms. Token bucket at the edge (allow bursts to load balancers) with leaky bucket at the database tier (smooth writes to PostgreSQL). Or use leaky bucket for writes (steady flow matters) and token bucket for reads (burst tolerance matters). The key insight: different parts of your system have different characteristics, so use the algorithm that matches each component.
Decision Framework
Ask these questions: Can downstream handle bursts? If yes, token bucket. If no, leaky bucket. Is latency variance acceptable? If users tolerate 100ms to 2 second variance, leaky bucket works. If they expect consistent sub 100ms, token bucket is better. Is traffic naturally bursty? If yes and downstream can handle it, do not artificially smooth it. If downstream cannot handle it, smoothing is necessary.
Common Mistakes
Using leaky bucket for user facing APIs when token bucket would work: creates unnecessary latency. Using token bucket for fragile backends: allows bursts that trigger cascading failures. Setting bucket capacity B too high: creates unacceptable worst case latency. Not monitoring queue depth: missing early warning of capacity mismatch. Ignoring head of line blocking: one slow request delays many. Always match the algorithm to the workload characteristics.