Caching • Cache Patterns (Aside, Through, Back)Hard⏱️ ~3 min
Failure Modes: Thundering Herds, Stale Reads, and Durability Risks
Cache systems face several catastrophic failure modes that can cascade into full outages if not anticipated. Thundering herd, also called cache stampede, occurs when many clients simultaneously miss the same hot key (often due to TTL expiry or eviction) and concurrently slam the source database. Symptoms include database queries per second spikes from hundreds to tens of thousands, tail latencies exploding from single digit milliseconds to seconds, and cascading timeouts that propagate upstream. For example, if 10000 web servers all miss a popular cache key at TTL expiry and each issues a database query simultaneously, the database can be overwhelmed. Mitigations include single flight or lease mechanisms that ensure only one loader fetches on miss while others wait, request coalescing at the cache layer, probabilistic early refresh where entries are refreshed before TTL with random bias, and per key rate limits.
Stale reads and write races plague cache aside implementations. The naive update cache on write pattern can race with concurrent readers: thread A reads stale data from database during a slow query, while thread B writes new data to both database and cache, then thread A overwrites the cache with its stale result. This leaves the cache serving incorrect data until TTL expiry. The solution is delete on write (invalidation): write to database, then delete the cache key, forcing the next reader to fetch fresh data. Version stamps provide additional safety: tag cached values with a version number and reject writes of older versions. Invalidation loss is another hazard where the delete operation fails or is delayed due to network partitions, leaving stale cache entries. Retryable, idempotent invalidation with outbox patterns or change data capture streams guarantee eventual delivery.
Write back systems face durability and reordering risks. Node loss before flush to the source results in lost writes unless the write buffer is durable and replicated, similar to database Write Ahead Logs. Out of order flush can cause older state to overwrite newer state if concurrent updates to the same key flush in wrong sequence. Mitigations require durable, replicated WALs in the cache tier, idempotent versioned upserts in the source that reject older versions, and strict flush delay Service Level Agreements (SLAs) bounding data loss windows to 1 to 5 seconds. Cold start after mass eviction or new node deployment causes load spikes as every request misses; warm critical keys proactively and apply TTL jitter to prevent synchronized expiry.
💡 Key Takeaways
•Thundering herd causes database QPS spikes from hundreds to tens of thousands when many clients simultaneously miss hot key at TTL expiry, tail latencies explode to seconds
•Single flight or lease mechanisms ensure only one loader fetches on miss while others wait, preventing duplicate database queries; Meta uses lease tokens across thousands of cache servers
•Stale read race: update cache on write pattern allows slow reader to overwrite with stale data; solution is delete on write to force fresh fetch on next read
•Invalidation loss from network failures leaves stale cache until TTL; use retryable idempotent invalidation with outbox or change data capture to guarantee delivery
•Write back durability requires replicated WAL in cache tier, idempotent versioned upserts, and bounded flush delay (1 to 5 seconds) to limit data loss window on node failure
•Cold start and synchronized expiry cause load spikes; warm critical keys proactively, apply TTL jitter (plus or minus 10 to 20 percent), and use refresh ahead for hot sets
📌 Examples
Meta thundering herd mitigation: On cache miss, acquire per key lease token with TTL. If acquired, load from MySQL and populate cache. If lease already held, wait briefly then retry cache get. Lease prevents 10000 web servers from simultaneously querying database when popular key expires.
Stale write prevention with versions: cache.get returns (value, version). On database write, newVersion = currentVersion + 1; database.update(key, newValue, newVersion); cache.delete(key). On late cache populate attempt, only accept if incomingVersion > cachedVersion to reject stale overwrites.
Write back recovery after crash: On startup, replay WAL entries in sequence order. For each mutation, check idempotency table in database: if (lastAppliedVersion >= mutation.version) skip; else apply mutation and update lastAppliedVersion. This prevents duplicate application and reordering.