GC Tuning Strategy: Metrics, Sizing, and Architectural Patterns
Key Metrics to Monitor
GC pause time: Duration of stop the world pauses. Track P99 not just average. A 5ms average with 500ms P99 means bad user experience 1% of the time.
GC frequency: How often collections occur. Frequent minor GCs are normal. Frequent major GCs indicate pressure. Increasing frequency over time suggests growing live data.
Heap utilization: Live data after GC divided by heap size. Under 50% means plenty of headroom. Over 70% risks frequent collections. Over 90% risks out of memory.
Heap Sizing Strategy
Start with 2x to 3x expected live data size. If live data is 2 GB, start with 4 to 6 GB heap. Too small causes frequent GC. Too large wastes memory and may increase pause times for non-concurrent collectors.
Monitor and adjust. If GC overhead exceeds 5 percent of CPU, increase heap. If pauses exceed SLA, consider different collector or reduce heap for concurrent collectors (smaller heap means faster concurrent marking).
Architectural Patterns
Off-heap storage: Store large data outside GC managed heap. Direct byte buffers, memory mapped files, or native allocations. GC does not scan or collect this memory. Useful for large caches.
Sharded heaps: Run multiple smaller JVMs instead of one large one. Each has smaller heap with faster GC. Requires request routing but improves worst case pause times.
Generational escape: Pre-allocate long lived objects at startup into pools. They promote once and stay. Avoids repeated promotion churn for known long lived data.