Memory Failure Modes: Thrashing, THP Stalls, and OOM
Thrashing
Thrashing occurs when working set exceeds physical memory. Pages constantly swap in and out. The system spends more time moving pages than running code. CPU utilization may show low because processes are blocked waiting for page I/O.
Detection is straightforward: high page fault rate combined with high disk I/O and low CPU utilization. A server with 90% disk wait and 10% CPU busy is likely thrashing. The fix is reducing memory pressure: add RAM, reduce working set, or kill memory heavy processes.
Transparent Huge Page (THP) Stalls
THP automatically promotes 4 KB pages to 2 MB huge pages when possible. This sounds helpful but causes problems. Promotion requires finding contiguous 4 KB pages to merge. Under memory pressure, the kernel compacts memory to create contiguous regions. Compaction stalls processes waiting for the huge page.
The stalls are unpredictable and can last hundreds of milliseconds. Database workloads often disable THP because these latency spikes violate SLAs. Explicit huge pages with hugetlbfs give the benefits without the compaction stalls.
OOM Killer Behavior
When memory exhausts, the kernel must free some. It scores processes by memory usage, age, and priority. The highest scoring process is killed. This can be your database while leaving the SSH daemon alive.
Control OOM behavior with oom_score_adj. Set critical processes to -1000 to protect them from OOM. Set non critical processes to positive values to make them targets. Or set vm.overcommit_memory to 2 to disable overcommit entirely: allocations fail when memory is tight rather than succeeding and causing later OOM.