Advanced Patterns: Queuing, Retry Handling, and Cold Start Mitigation
Beyond Binary Decisions
Basic token buckets give binary decisions: request passes or gets 429. Production systems need more nuance. Sometimes you want to queue requests briefly, tell clients exactly when to retry, or handle cold starts gracefully. These patterns transform token buckets from simple gates into sophisticated traffic management tools.
Queuing Instead of Immediate Rejection
When tokens run out, instead of immediate 429, queue the request and wait for tokens. A small queue (100 to 1,000 items) absorbs micro bursts without losing requests. The danger: queues can grow unbounded. Set max wait time: 500ms for user facing APIs, 5 seconds for batch jobs. Reject if wait exceeds this limit. For interactive APIs, prefer immediate 429 with retry guidance over unpredictable queue delays.
Helping Clients Retry Intelligently
When rejecting with 429, include a Retry-After header telling clients when tokens will be available. If your bucket refills in 1 second, respond with Retry-After: 1. This reduces wasted retry traffic by 5× to 10×. Clients should implement exponential backoff with jitter: start at 100 to 500ms, double on each retry, add random jitter. Without jitter, all rejected clients retry simultaneously, creating another rejection wave. Adding randomness spreads retries and reduces storms by 10×.
Cold Start Problems and Solutions
When a service restarts, its bucket might start full with b tokens available. If 10 services restart simultaneously during deployment with b = 1,000, you see a 10,000 request spike hitting cold downstream caches and databases. Solution: start with empty buckets and let them fill naturally, or prefill to 25% capacity and ramp up over 10 to 30 seconds. This prevents coordinated bursts during rolling deployments.
Graceful Degradation Under Load
When approaching rate limits, consider degraded responses instead of hard rejection. At 80% capacity, serve cached or simplified responses. At 100%, reject new requests but honor in flight ones. This keeps partial functionality available during spikes rather than complete failure. A search API might return cached results at 80%, truncated results at 90%, and reject new searches at 100% while completing ones already processing.