Caching • CDN CachingHard⏱️ ~3 min
CDN Failure Modes: Fragmentation, Poisoning, and Negative Caching
CDN caching introduces several failure modes that can degrade performance, leak data, or amplify outages if not carefully managed. Cache key fragmentation occurs when too many dimensions are included in the cache key, creating an explosion of cached variants. For example, including the full User-Agent header can generate thousands of variants for a single URL (every browser version, operating system, and device combination). Vary on Accept-Language without normalization can create dozens of variants (en-US, en-GB, en-AU treated as separate). The result is very low hit ratio (often dropping from 90% to below 40%), high memory churn, and frequent evictions. Well-designed systems whitelist only necessary headers, normalize to coarse categories (for example, map all English locales to "en", all mobile User-Agents to "mobile"), and strip tracking query parameters entirely.
Cache poisoning and privacy leaks happen when user-controlled inputs are incorrectly included in cache keys or when Vary headers are misconfigured. If a query parameter containing a user identifier or session token is part of the cache key but not validated, a malicious user can pollute the cache with attacker-controlled content that gets served to others. Similarly, omitting Vary: Accept-Encoding can cause a CDN to serve gzip-compressed content to clients that do not support compression, breaking pages. A production incident at a major site involved caching personalized content without varying on a cookie, leaking one user's account data to others until the cache expired. Mitigations include strict cache key whitelists, never including user IDs or session tokens in shared cache keys, always setting appropriate Vary headers, and testing cache behavior with diverse client configurations before rollout.
Negative caching of errors can lock in transient failures or amplify outages. If a CDN caches HTTP 500 or 404 responses with long TTL, a brief origin outage or deployment glitch becomes visible to users for the entire cache duration. Conversely, not caching errors at all means the CDN hammers the origin with repeated requests during an outage, worsening the problem. Best practice is to cache errors with very short TTL (1 to 30 seconds) and enable stale-if-error to serve last-known-good content when the origin is unhealthy. Netflix and other high-availability systems combine short error TTL, circuit breakers (stop fetching after sustained errors), and fallback to stale content, maintaining availability during origin incidents at the cost of serving slightly outdated data. Range request pathologies also cause issues: arbitrary byte-range requests on large files can bypass cache if not handled carefully, forcing full-file origin fetches for partial requests, or worse, caching every unique range as a separate object and exploding cache memory.
💡 Key Takeaways
•Cache key fragmentation from unbounded Vary headers (full User-Agent, Accept-Language variants) drops hit ratios from 90% to below 40% by creating thousands of variants per URL and causing memory churn
•Cache poisoning occurs when user-controlled query parameters or headers are cached without validation, allowing attackers to inject content served to other users; a leaked cookie-based personalization bug served user data to strangers
•Missing Vary: Accept-Encoding can serve gzip content to clients without compression support, breaking rendering; always vary on encoding and test with diverse client configurations
•Negative caching with long TTL (for example, caching 500 errors for 300 seconds) locks in transient failures; best practice is 1 to 30 second error TTL combined with stale-if-error to serve last-good content during outages
•Not caching errors at all causes CDN to repeatedly hammer origin during incidents, amplifying load; circuit breakers stop fetching after sustained errors (for example, 10 consecutive 500s) to protect origin
•Range request pathologies can force full-file origin fetches for partial requests or cache every unique byte range separately, exploding memory; mitigate with fixed-size slice caching (for example, 1 MB chunks) and range normalization
📌 Examples
An e-commerce site included full User-Agent in cache key, generating 3,000 variants for homepage URL across browsers and devices; hit ratio dropped to 35%, origin load tripled until key was simplified to device class (desktop, mobile, tablet)
A media site cached 404 responses with 600 second TTL during a deployment bug that briefly broke image URLs; users saw missing images for 10 minutes after fix was deployed until cache entries expired
A streaming service implements stale-if-error with 3600 second stale window and 10 second error TTL; during a 5 minute database outage returning 500s, users continued streaming cached segments (up to 1 hour old) with zero failed requests