Memory Failure Modes: Thrashing, THP Stalls, and OOM

Thrashing
Thrashing occurs when working set exceeds physical memory. Pages constantly swap in and out. The system spends more time moving pages than running code. CPU utilization may show low because processes are blocked waiting for page I/O.
Detection is straightforward: high page fault rate combined with high disk I/O and low CPU utilization. A server with 90% disk wait and 10% CPU busy is likely thrashing. The fix is reducing memory pressure: add RAM, reduce working set, or kill memory heavy processes.
Transparent Huge Page (THP) Stalls
THP automatically promotes 4 KB pages to 2 MB huge pages when possible. This sounds helpful but causes problems. Promotion requires finding contiguous 4 KB pages to merge. Under memory pressure, the kernel compacts memory to create contiguous regions. Compaction stalls processes waiting for the huge page.
The stalls are unpredictable and can last hundreds of milliseconds. Database workloads often disable THP because these latency spikes violate SLAs. Explicit huge pages with hugetlbfs give the benefits without the compaction stalls.
OOM Killer Behavior
When memory exhausts, the kernel must free some. It scores processes by memory usage, age, and priority. The highest scoring process is killed. This can be your database while leaving the SSH daemon alive.
Control OOM behavior with oom_score_adj. Set critical processes to -1000 to protect them from OOM. Set non critical processes to positive values to make them targets. Or set vm.overcommit_memory to 2 to disable overcommit entirely: allocations fail when memory is tight rather than succeeding and causing later OOM.
🎯 When To Use: Disable THP for latency sensitive workloads. Set oom_score_adj to protect critical processes. Monitor page fault rates and swap usage as early warning of memory pressure. Prefer failing fast on allocation over succeeding and dying later.

💡 Key Takeaways

✓Thrashing: working set exceeds RAM, constant page swapping, high disk wait with low CPU

✓THP can cause hundreds of milliseconds stalls during memory compaction

✓Many production databases disable THP to avoid unpredictable latency spikes

✓OOM killer scores processes by memory usage; oom_score_adj controls targeting

✓vm.overcommit_memory = 2 fails allocations instead of allowing later OOM

📌 Interview Tips

1If asked about memory performance issues, check for thrashing: high page fault rate plus high disk I/O plus low CPU

2For latency sensitive systems, recommend disabling THP. The compaction stalls can cause 100ms plus spikes

3Explain OOM protection: set oom_score_adj to -1000 for critical processes like your database

← Back to Memory Management & Virtual Memory Overview