OS & Systems FundamentalsGarbage Collection FundamentalsHard⏱️ ~3 min

Common GC Failure Modes and How to Prevent Them

Allocation thrash occurs when application allocation rate exceeds GC reclamation rate, often during traffic spikes. If your service allocates 1 to 2 GB per second (typical for high throughput JSON parsing, string manipulation) without sufficient headroom, the collector can't keep up: you hit back to back collections, allocation stalls, or fall back to full heap stop the world (STW) compaction. Symptoms include GC CPU spiking to 30+ percent, throughput collapsing by 50+ percent, and pauses jumping from 10 ms to multiple seconds. Mitigation requires provisioning headroom (run at 60 to 70 percent occupancy, not 85+ percent), object pooling for high allocation hot paths, or arena allocation for request scoped bursts. Promotion failure happens in generational collectors when survivors can't be promoted because old space is full or fragmented. This triggers emergency full compaction which can pause for seconds on large heaps, long enough to violate typical 5 to 30 second RPC timeouts. In distributed systems like Cassandra, Elasticsearch, or Kafka, a single node pausing 8 to 12 seconds gets marked down by heartbeat monitors, triggering shard rebalancing, leader election, and thundering herds that cascade load onto remaining nodes. Prevention requires tuning tenuring thresholds, increasing old space size, or switching to region based collectors that avoid full heap compaction. Tail latency amplification in microservices is particularly insidious. Even a 200 to 500 ms GC pause at p99.9 (1 in 1000 requests) becomes p99 when aggregated across 10 service hops, violating SLOs targeting single digit millisecond latencies. A user request touching 5 microservices has 0.5 percent chance of hitting at least one p99.9 pause if each service has independent p99.9 of 0.1 percent. This manifests as bimodal latency distributions: most requests complete in 5 to 10 ms, but p99 jumps to 300+ ms. Native code and safepoint delays are subtle. Java threads blocked in native code (JNI calls, system calls) or spinning without safepoint polls prevent GC from reaching a global safepoint, elongating STW phases unpredictably. A single thread stuck in a 100 ms native IO call delays the entire GC safepoint, turning a 10 ms pause into 110 ms. This shows up as pause time variance uncorrelated with heap pressure: you'll see occasional long pauses even when allocation rate is low.
💡 Key Takeaways
Allocation thrash occurs when allocation rate (1 to 2 GB/s typical for high throughput services) exceeds reclaim rate: triggers back to back collections, allocation stalls, GC CPU spikes to 30+ percent, and emergency full GC pauses of multiple seconds; prevent by running at 60 to 70 percent occupancy with 30 to 40 percent headroom
Promotion failure in generational GC when old space full or fragmented forces full heap compaction pausing seconds on large heaps, enough to violate 5 to 30 second RPC timeouts and trigger cascading failures in distributed systems (node marked down, shard rebalancing, thundering herd)
Tail latency amplification: a p99.9 pause of 200 to 500 ms in one microservice becomes p99 when aggregated across 10 service hops (0.5 percent chance per request if services independent), turning 5 ms median into 300+ ms p99 bimodal distribution
Native code and safepoint delays: Java threads blocked in JNI or system calls prevent global GC safepoint, elongating STW phases unpredictably; a single thread in 100 ms native IO call delays entire GC from 10 ms to 110 ms, uncorrelated with heap pressure
Large object allocation and pinning resist evacuation, fragmenting heap and forcing full compaction later; allocating many large buffers (e.g., 10 MB each for IO) creates GC cliffs even if average allocation rate is modest
📌 Examples
E-commerce service during Black Friday traffic spike: allocation rate jumps from 400 MB/s to 2 GB/s, heap at 85 percent occupancy, young GC can't keep pace, triggers 3 full GCs in 10 seconds with 5 second pauses each, request timeout rate spikes from 0.01 percent to 12 percent
Kafka broker with 24 GB heap experiences promotion failure during log compaction: old space 88 percent full and fragmented, young survivors can't promote, full compaction pauses 9 seconds, exceeds 10 second session timeout, broker kicked from cluster, partitions rebalanced, remaining brokers overwhelmed causing 3 more failures
Payment API with 5 microservice hops (auth, fraud, ledger, notification, audit): each service p99.9 GC pause 300 ms, user request has 1.5 percent chance of hitting at least one pause, p99 latency 450 ms vs 8 ms median, violates 50 ms p99 SLO
Java analytics service calling native compression library via JNI: compression takes 80 to 120 ms per call, 1 thread always in native code during steady load, GC safepoint waits for native call completion, p99 GC pause 130 ms (10 ms GC + 120 ms safepoint wait) despite low heap pressure
← Back to Garbage Collection Fundamentals Overview
Common GC Failure Modes and How to Prevent Them | Garbage Collection Fundamentals - System Overflow