OS & Systems Fundamentals • Memory Management & Virtual MemoryHard⏱️ ~3 min
Memory Failure Modes: Thrashing, THP Stalls, and OOM
Virtual memory systems fail in predictable ways under stress. Thrashing occurs when the combined working sets of active processes exceed physical RAM. The system spends more time servicing page faults and evicting pages than doing useful work. CPU time shifts to kernel mode (page fault handling, reclaim), major fault rates spike, and I/O queues saturate. Even a 0.1% page fault rate can destroy effective memory access time because major faults are 1,000x to 100,000x slower than DRAM. Recovery requires reducing the number of active processes, increasing RAM, or shrinking working sets.
Transparent Huge Pages (THP) introduce latency unpredictability. THP automatically promotes 4 KB pages to 2 MB by scanning memory and compacting pages in the background. Compaction can stall for 5 to 20 milliseconds, blocking allocations. THP also triggers TLB shootdowns (inter processor interrupts to invalidate TLB entries across all CPUs), causing microsecond to millisecond pauses. For latency sensitive databases, these stalls cause p99 query latency to spike from 2 ms to 25 ms. Meta and Netflix disable THP for MySQL, RocksDB, and other critical services, using explicit huge pages only for carefully controlled regions.
Out Of Memory (OOM) situations occur when committed memory exceeds physical RAM plus swap and reclaimable pages. If many processes dirty their Copy On Write pages simultaneously under memory overcommit, the kernel cannot satisfy allocations and invokes the OOM killer, which selects a victim process based on heuristics and terminates it. This is catastrophic for stateful services. The safeguard is strict memory limits (via cgroups or ulimits), maintaining 10% to 20% headroom, and monitoring committed versus physical memory. Google's Borg scheduler enforces hard limits per task, isolating failures. Amazon and Netflix similarly constrain memory and disable swap for latency critical tiers to prevent both OOM and major fault latency spikes.
💡 Key Takeaways
•Thrashing happens when working sets exceed RAM. Major fault rates spike, CPU time shifts to kernel page fault handling, I/O saturates. Even 0.1% fault rate destroys throughput. Fix: reduce processes, add RAM, shrink working sets.
•Transparent Huge Pages (THP) cause unpredictable stalls. Background compaction for 2 MB pages can block for 5 to 20 milliseconds. TLB shootdowns add microsecond to millisecond pauses. p99 latency spikes violate SLOs.
•Meta and Netflix disable THP for latency sensitive services (MySQL, RocksDB). They use explicit huge pages only for static, controlled regions like buffer pools to avoid compaction induced pauses.
•Out Of Memory (OOM) killer terminates a victim when committed memory exceeds reclaimable capacity. Catastrophic for stateful services. Use cgroup limits, maintain 10% to 20% headroom, monitor committed versus physical memory.
•TLB shootdowns occur when page table entries change. The CPU sends inter processor interrupts to invalidate TLB entries on all cores. Frequent mapping changes (like THP promotion or rapid fork/exit) cause pauses.
•Google Borg enforces per task memory limits below node capacity, isolating OOM to individual tasks. Amazon latency sensitive services disable swap to prevent major faults from ruining tail latency.
📌 Examples
A server runs 10 processes, each with a 5 GB working set, on a machine with 40 GB RAM. Working sets fit. Adding 5 more processes pushes working sets to 75 GB. The system thrashes: major fault rate jumps to 1000 per second, CPU spends 80% time in kernel, throughput drops 90%.
A MySQL instance on a 128 GB server has THP enabled. During peak traffic, THP compaction stalls an allocation for 15 milliseconds. A query waiting on memory experiences p99 latency spike from 3 ms to 18 ms, breaking the 10 ms SLO. Meta disables THP and uses explicit 2 MB huge pages for InnoDB buffer pool, eliminating stalls.
A microservice under load allocates memory rapidly. Committed memory exceeds physical RAM plus cgroup limit. OOM killer selects and kills the process, causing 30 seconds of downtime and data loss. Kubernetes restarts the pod, but some in flight transactions are lost.