Failure Modes and Edge Cases: What Breaks Cache Invalidation in Production

Lost or Delayed Invalidation Events
The most critical failure: events never arrive or arrive minutes late due to message broker failures, network partitions, bugs in handlers that silently swallow errors, or capacity saturation where queues back up. Symptoms include serving stale data until TTL expiry (could be hours for long TTL strategies) and silent correctness bugs like showing private content after visibility changed or displaying sold out inventory as available. Mitigate with at least once delivery with idempotent handlers (so retries are safe), aggressive max TTL caps (minutes not hours), periodic reconcilers scanning for cache vs origin divergence, and canary comparison reads sampling random keys.
Out of Order Invalidations
User updates profile twice rapidly (upload A at time T, upload B at T+1s). Due to network delays or retry logic, invalidation for B arrives before A. If cache is empty, second event (older A) repopulates with stale data. User sees old picture until TTL despite successful upload. Especially pernicious with at least once delivery where retries shuffle ordering. Solutions: versioned writes with monotonically increasing version numbers, compare and set cache semantics rejecting older versions attempting to overwrite newer, event partitioning by entity ID (all events for user:123 go to same partition) preserving per key order.
Thundering Herd and Delete Storms
Thundering herd: popular entry expires, thousands of concurrent requests hit origin simultaneously, spiking database CPU to 95% and p99 (99th percentile) latency to 5 seconds. Mitigate with single flight patterns (one fetcher, others wait for result), stale while revalidate serving old data during refresh, and per key jitter on TTL. Delete storms: updating one hot object invalidates thousands of derived keys, overwhelming invalidation pipeline. Cache hit rate collapses as entries deleted. Prevent with versioned or generational keys transforming O(N) deletes into O(1) version bumps.
Multi Region Partial Failures
Network partition propagates invalidation to region A but not region B for seconds to minutes. Users in B see stale data. User sees new data in A, roams to B, sees old data, violating monotonic read guarantees. Mitigate with region local read your writes via session stickiness and explicit version tokens in requests.

💡 Key Takeaways

✓Lost events cause silent correctness bugs until TTL. Mitigate with at least once delivery, max TTL caps (minutes not hours), periodic reconcilers scanning for divergence.

✓Out of order invalidations from retries cause older updates to overwrite newer. Prevent with versioned writes, compare and set, entity ID partitioning.

✓Thundering herd spikes origin to 95% CPU, 5s p99. Single flight limits origin hits to one per key; stale while revalidate serves old data during refresh.

✓Delete storms from high fan out collapse hit rates. Versioned/generational keys transform O(N) deletes to O(1) version bumps.

📌 Interview Tips

1Lost event scenario: message broker outage delays invalidations 30 seconds. Max TTL of 90 seconds limits blast radius to under 2 minutes staleness.

2Out of order scenario: two rapid profile updates, retries shuffle order, stale data repopulates cache. Version numbers let cache detect and discard older updates.

3Thundering herd: 50,000 concurrent requests hit origin on popular entry expiry. Single flight limits to 1 origin hit, others wait for result.

← Back to Cache Invalidation Strategies Overview