Failure Modes: Cardinality Explosions and Backpressure Collapse

Log aggregation systems fail in predictable but devastating ways when cardinality or volume assumptions break. Understanding these failure modes is essential for building resilient pipelines that survive production incidents instead of amplifying them.

Cardinality explosions occur when high cardinality fields like user_id, request_id, or session_id get promoted to indexed labels. Each unique value consumes memory and CPU in indexers. A service logging millions of unique request IDs per hour can exhaust indexer memory within minutes, causing out of memory crashes or query timeouts as the index size explodes. Symptoms include runaway memory growth, elevated CPU from index maintenance, and eventually complete indexer failure. Mitigation requires label allow and deny lists at agents, dynamic down labeling when cardinality exceeds thresholds (say 10,000 unique values per hour), per tenant cardinality quotas, and auto sanitization that hashes or truncates high entropy fields before indexing.

Backpressure collapse happens during log storms when sudden traffic spikes saturate the pipeline. Common triggers include crash loops generating stack traces, tight retry loops, or deployment failures affecting hundreds of services simultaneously. Agents buffer to disk until buffers fill, then either drop logs or block application I/O. Downstream brokers and ingesters saturate, latency spikes, and cascading retries amplify the load. Within minutes, the entire system can enter a death spiral where dropping logs breaks alerting, causing more incidents, generating more logs.

The defense is defense in depth. At the edge, implement per source rate limits (say 10 MB/s per host) and lossy sampling that drops verbose debug logs under pressure while preserving errors. Agents need bounded on disk buffers with explicit overflow behavior. Priority channels let critical error logs bypass queues. In the pipeline, admission control and circuit breakers shed load before saturation. For example, if ingester latency exceeds 500 ms, drop 50 percent of low priority traffic. The key insight is that partial data during incidents is infinitely better than complete system failure and zero visibility.

💡 Key Takeaways

•Cardinality explosion from indexing high cardinality fields like user_id or request_id causes millions of unique values per hour, exhausting indexer memory and CPU within minutes leading to crashes

•Mitigation uses label allow/deny lists, dynamic down labeling when cardinality exceeds thresholds (10,000 unique values per hour), per tenant quotas, and auto hashing of high entropy fields

•Backpressure collapse during log storms (crash loops, tight retries) saturates brokers and ingesters, causing agents to drop logs or block application I/O, creating cascading failures

•Defense in depth includes per source rate limits (10 MB/s per host), lossy sampling dropping debug logs under pressure while preserving errors, bounded buffers, and priority channels for critical logs

•Admission control and circuit breakers shed load before saturation, for example dropping 50 percent of low priority traffic when ingester latency exceeds 500 ms to maintain partial visibility

📌 Examples

A microservice logging unique request_id for every API call at 100K requests per second generates 360 million unique values per hour, causing indexer memory to grow from 50 GB to 200 GB in 30 minutes before crashing

During a deployment failure affecting 200 services, crash loop stack traces spike log volume from 5 GB/s to 50 GB/s, saturating Kafka brokers, causing 10 second lag, and forcing agents to drop 80 percent of logs

← Back to Log Aggregation & Analysis Overview