Rate LimitingToken Bucket AlgorithmHard⏱️ ~3 min

Advanced Patterns: Queuing, Retry Handling, and Cold Start Mitigation

Beyond Binary Decisions

Basic token buckets give binary decisions: request passes or gets 429. Production systems need more nuance. Sometimes you want to queue requests briefly, tell clients exactly when to retry, or handle cold starts gracefully. These patterns transform token buckets from simple gates into sophisticated traffic management tools.

Queuing Instead of Immediate Rejection

When tokens run out, instead of immediate 429, queue the request and wait for tokens. A small queue (100 to 1,000 items) absorbs micro bursts without losing requests. The danger: queues can grow unbounded. Set max wait time: 500ms for user facing APIs, 5 seconds for batch jobs. Reject if wait exceeds this limit. For interactive APIs, prefer immediate 429 with retry guidance over unpredictable queue delays.

Helping Clients Retry Intelligently

When rejecting with 429, include a Retry-After header telling clients when tokens will be available. If your bucket refills in 1 second, respond with Retry-After: 1. This reduces wasted retry traffic by 5× to 10×. Clients should implement exponential backoff with jitter: start at 100 to 500ms, double on each retry, add random jitter. Without jitter, all rejected clients retry simultaneously, creating another rejection wave. Adding randomness spreads retries and reduces storms by 10×.

Cold Start Problems and Solutions

When a service restarts, its bucket might start full with b tokens available. If 10 services restart simultaneously during deployment with b = 1,000, you see a 10,000 request spike hitting cold downstream caches and databases. Solution: start with empty buckets and let them fill naturally, or prefill to 25% capacity and ramp up over 10 to 30 seconds. This prevents coordinated bursts during rolling deployments.

Graceful Degradation Under Load

When approaching rate limits, consider degraded responses instead of hard rejection. At 80% capacity, serve cached or simplified responses. At 100%, reject new requests but honor in flight ones. This keeps partial functionality available during spikes rather than complete failure. A search API might return cached results at 80%, truncated results at 90%, and reject new searches at 100% while completing ones already processing.

💡 Key Takeaways
Small queues (100 to 1,000 items) behind token checks smooth micro-bursts for batch workloads. Monitor queue wait time; shed requests exceeding 500 milliseconds for user facing APIs or 5 seconds for batch to prevent unbounded latency growth.
Server side Retry-After headers signal token availability timing. With r = 1,000/sec and bucket empty, return Retry-After: 1 to hint clients to wait 1 second. This reduces wasted retry traffic by 5 to 10× compared to blind retries.
Client exponential backoff with full jitter prevents thundering herds. Start at 100 to 500 milliseconds, double on each 429, add random jitter in [0, backoff] range. Stripe merchants using this pattern see 429 rates drop by 10× and retry p99 under 50 milliseconds.
Cold start prefilling to full b can overwhelm downstream during restarts. Start empty or at 0.25× b and ramp over 10 to 30 seconds. Kubernetes components ramp quota during rolling updates, avoiding API server request spikes that cause 5 to 10 second latency blips.
Fractional token accounting prevents free riding. For non-integer costs, use fixed point math (e.g., millitoken units) and allow fractional accumulation. Cap negative debt to bound recovery time; with r = 1,000/sec, limit debt to 1,000 millitokens (1 token, 1 millisecond recovery).
📌 Interview Tips
1AWS Lambda queues requests briefly during cold start (typically under 1 second), then sheds if initialization exceeds 3 to 5 seconds. This balances throughput (not losing bursty traffic) and latency (avoiding 10+ second waits).
2Salesforce integration starts token bucket at b = 0 on process restart, ramping to full b = 1,000 over 30 seconds. This prevents initial 1,000 request spike into Salesforce concurrency limits, which would trigger 5xx errors and expensive retries.
3API client implements exponential backoff: first 429 waits random [0, 200ms], second waits [0, 400ms], third [0, 800ms], capped at [0, 5000ms]. Telemetry shows 95% of retries succeed within 1 second, and aggregate 429 rate drops from 15% to under 2%.
← Back to Token Bucket Algorithm Overview