Networking & ProtocolsCDN Architecture & Edge ComputingHard⏱️ ~3 min

What Are Cache Stampedes, Poisoned Caches, and Other CDN Failure Modes?

Cache stampede occurs when a popular object not yet cached or recently purged receives many concurrent requests, triggering simultaneous origin fetches that overwhelm the origin server. This thundering herd effect is especially dangerous during traffic spikes on viral content or after global purges. Production systems mitigate stampedes through request collapsing, where the CDN queues concurrent requests for the same object and fetches from origin only once per PoP. Origin shielding adds another layer by designating a regional mid tier cache that edge PoPs query before reaching the origin, further reducing origin request fan out. Negative caching helps by briefly caching certain error responses (like 404 Not Found) so that retries for nonexistent content do not repeatedly hit the origin. Cache poisoning happens when cache key normalization errors allow untrusted request components to influence cached entries. For example, if the cache key includes the X Forwarded Host header without validation, an attacker can inject malicious values that get cached and served to other users. This can enable cross site scripting, cache deception, or serve malicious redirects to legitimate users. Defense requires strictly defining cache keys to include only trusted components, normalizing values (lowercasing hostnames, sorting query parameters), and stripping or validating untrusted headers. Stale cache issues arise from excessively long TTLs or purge propagation lag. Users see outdated prices, sold out inventory, or deprecated application code. Versioned URLs eliminate this for immutable assets. For dynamic content, short TTLs combined with background revalidation balance freshness and performance. Anycast and BGP routing anomalies can steer users to suboptimal PoPs, increasing latency. A routing event or poor peering arrangement might send users across the country instead of to a nearby site. Production systems monitor per PoP latency distributions and use health checks to withdraw BGP routes for unhealthy sites. DDoS saturation remains a risk even with scrubbing. While large footprints spread volumetric attacks across many sites, a sufficiently large attack or concentration on a small PoP can saturate last mile links. Rate limiting earlier in the pipeline and prioritizing critical traffic over best effort flows help during saturation. Edge runtime constraints cause their own failure modes: exceeding CPU, memory, or time budgets aborts executions, creating elevated error rates and tail latencies. Keep functions small, deterministic, and defensive. Finally, data sovereignty issues arise when serving from nearby PoPs inadvertently moves user identity data across regulatory borders. Partition data and restrict routing by geography to maintain compliance.
💡 Key Takeaways
Cache stampede (thundering herd) overwhelms origins when many concurrent requests fetch uncached popular content; mitigate with request collapsing (one origin fetch per PoP), origin shielding (regional mid tier), and negative caching of errors
Cache poisoning occurs when untrusted headers like X Forwarded Host influence cache keys without validation, allowing attackers to inject malicious cached responses served to other users; defense requires strict cache key definition and header normalization
Purge propagation lag creates eventual consistency windows (typically seconds) where PoPs serve mixed content versions; monitor per region purge latency and expose SLAs around propagation time
Anycast and BGP anomalies can route users to distant PoPs instead of nearest, increasing latency from 10 to 25 ms to 100+ ms; use health checks and route withdrawal plus multi region prefixes for smooth failover
Edge runtime constraint violations (exceeding 50 ms CPU budget, memory caps, or timeout limits) abort executions and spike tail latencies; keep functions small, deterministic, and within provider budgets
📌 Examples
A major news site experiences a cache stampede when breaking news purges old articles and a viral tweet drives 50,000 requests per second to the new URL. Without request collapsing, each of 300 PoPs would fire simultaneous origin fetches. With collapsing, each PoP makes one origin request and queues concurrent requests, reducing origin load from 15 million to 300 requests.
An attacker exploits cache poisoning by sending requests with X Forwarded Host: evil.com. If the cache key includes this header, the CDN caches responses with URLs pointing to evil.com and serves them to legitimate users. The fix: exclude X Forwarded Host from cache keys or validate it against an allowlist before caching.
Microsoft Azure CDN detects a BGP routing event that accidentally steers European users to Asian PoPs, increasing latency from 15 ms to 180 ms. Automated health checks observe elevated latency percentiles, trigger route withdrawal from affected Asian sites, and reroute European traffic to healthy European PoPs within two minutes.
← Back to CDN Architecture & Edge Computing Overview
What Are Cache Stampedes, Poisoned Caches, and Other CDN Failure Modes? | CDN Architecture & Edge Computing - System Overflow