Failure Modes: Race Conditions, Clock Issues, and Thundering Herds
Why Token Buckets Fail
The algorithm is simple, but implementation details create subtle bugs that only surface under production load. These failures can let through 2 to 10x your intended rate or artificially deny legitimate traffic. Here are the three most common ways token buckets break in production.
Race Conditions
Two requests check the bucket simultaneously, both see 10 tokens, both proceed, both subtract, and now you have 8 tokens but admitted 2 requests from 10. Under high concurrency, this "double spending" can let through 2 to 10x your intended rate. The fix: use atomic operations. In code, use compare and swap or atomic counters. In Redis, wrap check and decrement in a Lua script so it executes as one indivisible operation. Never do "read tokens then decide then write tokens" as separate steps. The Lua script pattern: local tokens = redis.call('GET', key); if tokens > 0 then redis.call('DECR', key); return 1 end; return 0
Clock Jumps
Token buckets refill based on elapsed time. If you use wall clock time and system clock jumps forward by 1 second (NTP correction), you just minted 1,000 free tokens instantly (with r equals 1000/sec). Clock jumps backward freeze refills, creating artificial denials. The fix: always use monotonic clocks. They only move forward at steady rate regardless of NTP adjustments. In code, that means CLOCK_MONOTONIC on Linux or System.nanoTime() in Java, not current wall clock time.
Thundering Herd at Refill
If you refill tokens in discrete ticks (dump 1,000 tokens every second at :00), all waiting requests fire simultaneously when tick hits. Downstream sees 1,000 request spike every second instead of smooth 1,000/sec flow. The fix: use continuous refill. On each request, calculate tokens = current + r times (time since last check). This spreads arrivals naturally. If you must use ticks, add random jitter so not all buckets refill at exactly the same moment.
Burst Size Mistakes
Setting burst capacity too high feels generous but creates problems. If b equals 10,000 with r equals 1,000/sec, a user can dump 10 seconds worth instantly, overwhelming database connection pool. Rule of thumb: set b to 0.5 to 1.0 seconds worth of r unless you have verified downstream can absorb larger spikes.