OS & Systems Fundamentals • Garbage Collection FundamentalsHard⏱️ ~3 min
GC Tuning Strategy: Metrics, Sizing, and Architectural Patterns
Effective GC tuning starts with baseline metrics under representative load. Measure pause time distribution (p50/p99/p99.9/max) against your service level objectives (SLOs): user facing APIs typically target p99 under 10 to 50 ms, batch jobs may tolerate seconds. Track GC CPU percentage and mutator utilization (aim for 90 to 95+ percent mutator under normal load). Monitor allocation rate (MB/s), promotion rate (MB/s into old space), young versus old collection frequency, and post GC heap occupancy trends. A rising post GC occupancy baseline indicates memory leak or excessive tenuring.
Heap sizing requires understanding your live set (reachable object footprint under steady load) and allocation rate. If live set is 16 GB and steady allocation is 200 MB/s with spikes to 800 MB/s, a 32 to 48 GB heap gives concurrent collectors enough headroom to operate without promotion failures. Keep steady state occupancy below 70 to 80 percent for concurrent collectors; running above 85 percent risks emergency full stop the world (STW) compaction. For throughput collectors, tighter sizing (10 to 20 percent headroom) is acceptable since they batch work into planned long pauses.
Architectural patterns can reduce GC impact more than tuning alone. Request scoped or batch scoped arena allocation (as in HHVM, many database engines, and RPC frameworks) turns many GC problems into lifetime problems: allocate from a region and free the entire region at scope end, achieving O(1) reclamation. Off heap memory for large caches or buffers (using direct ByteBuffers or native allocation) removes them from GC scanning, but reintroduces manual memory management risks and potential fragmentation. Object pooling for high allocation hot paths (e.g., buffer pools, thread locals for scratch space) reduces allocation pressure but adds complexity and potential leaks if pools aren't bounded.
Backpressure and load shedding based on GC health metrics prevent runaway failures. If GC CPU exceeds 20 to 30 percent or p99 pause time exceeds 2x your SLO budget, reject new requests or degrade non critical features to stabilize the system. This circuit breaker pattern prevents cascading failures: a single node entering GC thrash can trigger timeouts, retries, and thundering herds that spread failure fleet wide. Monitor evacuation failures, concurrent mode failures, or fallback full GCs as P0 signals; even a single occurrence per hour suggests you're operating too close to capacity.
💡 Key Takeaways
•Baseline metrics critical for tuning: pause distribution (p50/p99/p99.9/max) against SLOs (e.g., p99 under 10 to 50 ms for APIs), GC CPU percentage and mutator utilization (target 90 to 95+ percent mutator), allocation and promotion rates (MB/s), post GC occupancy trends (rising baseline indicates leak)
•Heap sizing formula: live set (post full GC occupancy) plus headroom for allocation rate; 16 GB live set with 200 to 800 MB/s allocation needs 32 to 48 GB heap for concurrent collectors running at 60 to 70 percent occupancy, or 20 to 24 GB for throughput collectors at 75 to 85 percent
•Arena/region allocation for request or batch scoped objects (HHVM pattern) achieves O(1) bulk reclamation at scope end, eliminating per request GC overhead and stabilizing tail latency; off heap memory removes large caches from GC but reintroduces manual management risks
•Backpressure and load shedding based on GC health: if GC CPU exceeds 20 to 30 percent or p99 pause exceeds 2x SLO budget, reject new requests to prevent cascading failures; monitor evacuation failures and concurrent mode failures as P0 signals even at 1 per hour frequency
•Choosing collector strategy: batch/offline compute uses throughput collectors tolerating long pauses for maximum CPU; latency sensitive APIs use low latency concurrent collectors with pause goals and generous headroom; real time (sub 100 microsecond) requires non GC approaches (ownership, RAII, Rust)
📌 Examples
Payments API with 10 ms p99 SLO: measure live set 12 GB, allocation 300 MB/s steady 1 GB/s spike, provision 36 GB heap (3x live set) for ZGC, run at 65 percent occupancy, achieve 6 ms p99 GC pause, 92 percent mutator utilization
Batch ETL job processing 5 TB data: live set 40 GB (intermediate aggregations), allocation 500 MB/s, use Parallel GC with 50 GB heap (1.25x live set), accept 1 to 3 second full GC pauses every 15 minutes to maximize throughput, complete 12 percent faster than with concurrent collector
Ad serving service implements load shedding: when GC CPU exceeds 25 percent or p99 pause exceeds 20 ms (2x 10 ms budget), return 503 for 10 percent of requests and disable impression logging (non critical feature), system stabilizes without cascading to dependencies
Real time bidding (RTB) platform requires p99 under 50 microseconds: abandons JVM entirely, rewrites critical path in Rust with zero copy parsing and bump pointer allocation into preallocated arenas, achieves 30 microsecond p99 with no GC pauses