Failure Modes and Edge Cases: What Breaks Cache Invalidation in Production
Lost or Delayed Invalidation Events
The most critical failure: events never arrive or arrive minutes late due to message broker failures, network partitions, bugs in handlers that silently swallow errors, or capacity saturation where queues back up. Symptoms include serving stale data until TTL expiry (could be hours for long TTL strategies) and silent correctness bugs like showing private content after visibility changed or displaying sold out inventory as available. Mitigate with at least once delivery with idempotent handlers (so retries are safe), aggressive max TTL caps (minutes not hours), periodic reconcilers scanning for cache vs origin divergence, and canary comparison reads sampling random keys.
Out of Order Invalidations
User updates profile twice rapidly (upload A at time T, upload B at T+1s). Due to network delays or retry logic, invalidation for B arrives before A. If cache is empty, second event (older A) repopulates with stale data. User sees old picture until TTL despite successful upload. Especially pernicious with at least once delivery where retries shuffle ordering. Solutions: versioned writes with monotonically increasing version numbers, compare and set cache semantics rejecting older versions attempting to overwrite newer, event partitioning by entity ID (all events for user:123 go to same partition) preserving per key order.
Thundering Herd and Delete Storms
Thundering herd: popular entry expires, thousands of concurrent requests hit origin simultaneously, spiking database CPU to 95% and p99 (99th percentile) latency to 5 seconds. Mitigate with single flight patterns (one fetcher, others wait for result), stale while revalidate serving old data during refresh, and per key jitter on TTL. Delete storms: updating one hot object invalidates thousands of derived keys, overwhelming invalidation pipeline. Cache hit rate collapses as entries deleted. Prevent with versioned or generational keys transforming O(N) deletes into O(1) version bumps.
Multi Region Partial Failures
Network partition propagates invalidation to region A but not region B for seconds to minutes. Users in B see stale data. User sees new data in A, roams to B, sees old data, violating monotonic read guarantees. Mitigate with region local read your writes via session stickiness and explicit version tokens in requests.