CachingCache Invalidation StrategiesHard⏱️ ~3 min

Failure Modes and Edge Cases: What Breaks Cache Invalidation in Production

Cache invalidation fails in subtle and dangerous ways that cause correctness bugs, outages, and data exposure. The most critical failure mode is lost or delayed invalidation events, where the event notifying caches of a data change never arrives or arrives seconds to minutes late. Symptoms include serving stale data until TTL expiry (which could be hours for long TTL strategies) and silent correctness bugs like showing private content after a user changed visibility settings or displaying sold out inventory as available. This happens due to message broker failures, network partitions between invalidation producer and consumers, bugs in event handlers that silently swallow errors, or capacity saturation where invalidation queues back up under load. Mitigations include at least once delivery with idempotent handlers so retries are safe, setting aggressive max TTL caps (no key lives longer than a few minutes) as a safety net, running periodic reconcilers that scan for divergence between cache and origin, and deploying canary comparison reads that sample random keys comparing cached versus origin values to detect drift. Out of order invalidations create a related but distinct failure where events arrive in the wrong sequence, causing older values to overwrite newer ones in cache after rapid successive updates. For example, a user updates their profile picture twice in quick succession (upload A at time T, upload B at time T plus 1 second); the invalidations publish as events, but due to network delays or retry logic, the invalidation for upload B arrives first, followed by the invalidation for upload A. If cache is empty, the second event (for the older upload A) gets processed, the cache is repopulated with stale data, and the user sees their old picture until TTL expiry despite successfully uploading the new one. This is especially pernicious with at least once delivery where retries shuffle ordering. Solutions include versioned writes where each update carries a monotonically increasing version number or timestamp (with clock skew tolerance), and cache setters use compare and set semantics to reject older versions attempting to overwrite newer cached values. Event partitioning by entity identifier (all events for user:123 go to the same partition) preserves ordering per key when using ordered message brokers like Kafka. Thundering herds and delete storms represent operational failure modes that cascade into outages. Thundering herd occurs when a popular cache entry expires or is invalidated and thousands of concurrent requests simultaneously hit the origin to fetch the value, overloading the database and spiking latency. Pinterest and Meta both report using single flight patterns (also called request coalescing or leases) where only one requester fetches from origin while others wait for the result, combined with stale while revalidate to serve slightly stale data during refresh. Delete storms happen when updating a single hot object requires invalidating thousands of derived keys (a viral post appearing in millions of feeds), overwhelming your invalidation pipeline and causing it to lag or drop messages. The cache hit rate collapses as entries are deleted, amplifying origin load. Versioned or generational keys prevent this by making invalidations O(1) rather than O(N). Multi region partial failures add another dimension: if invalidation propagates to region A but not region B due to a network partition, users in region B see stale data for seconds to minutes, violating monotonic read guarantees (a user sees new data in region A, then roams to region B and sees old data). Region local read your writes via session stickiness and explicit version tokens in requests are common mitigations.
💡 Key Takeaways
Lost or delayed invalidation events cause silent correctness bugs (private content shown, sold out inventory available) until TTL expiry; mitigate with at least once delivery, idempotent handlers, max TTL caps (minutes not hours), and periodic reconcilers scanning for cache versus origin divergence
Out of order invalidations from retries or network delays cause older updates to overwrite newer cached values; prevent with versioned writes (monotonic version numbers), compare and set cache semantics rejecting older versions, and event partitioning by entity identifier to preserve per key ordering
Thundering herd on hot key expiry sends thousands of concurrent origin requests spiking latency and overloading databases; mitigate with single flight patterns (one fetcher, others wait), stale while revalidate serving slightly old data during refresh, and per key jitter on TTL
Delete storms from high fan out invalidations (update viral post invalidates millions of feed keys) overwhelm invalidation infrastructure and collapse cache hit rates; prevent with versioned or generational keys transforming O(N) deletes into O(1) version bumps
Multi region partial failures from network partitions propagate invalidations to some regions but not others for seconds to minutes, violating monotonic reads when users roam between regions; mitigate with region local read your writes via session stickiness and version tokens
Negative cache poisoning caches not found results masking newly created data; use very short TTL for negative entries (seconds), invalidate on create events, or use version epochs that invalidate on object creation to prevent indefinite hiding
📌 Examples
A message broker outage at Meta in 2021 delayed invalidations by 30 seconds causing some users to see stale feed content briefly; the incident report noted max TTL caps (90 seconds for most objects) limited blast radius and automatic reconcilers detected divergence triggering alerts within 2 minutes
An e commerce site experienced out of order invalidations when a customer rapidly changed shipping address twice: retry logic for the first update's invalidation arrived after the second, repopulating cache with the old address; fix deployed versioned address writes with compare and set rejecting version 1 overwriting version 2
Pinterest reported a thundering herd incident where a popular pin's cache entry expired during peak traffic, sending 50000 concurrent requests to origin databases, spiking database CPU to 95% and p99 latency to 5 seconds; resolution added single flight coalescing limiting origin hits to 1 per key with 10 second lease window
← Back to Cache Invalidation Strategies Overview