OS & Systems Fundamentals • Memory Management & Virtual MemoryHard⏱️ ~3 min
NUMA Memory Locality and Cross-Socket Access Costs
Non Uniform Memory Access (NUMA) systems have multiple CPU sockets, each with local memory. A CPU can access local memory quickly but must traverse an inter socket link (like Intel UltraPath Interconnect or AMD Infinity Fabric) to reach remote memory attached to another socket. Remote accesses add 1.3x to 2.0x latency and consume inter socket bandwidth, reducing throughput.
The first touch policy governs NUMA placement. When a virtual page is first accessed (faulted in), the kernel allocates the physical frame from the local NUMA node of the CPU that caused the fault. If threads later migrate to another socket, their memory remains on the original node, causing remote accesses. For optimal performance, threads should initialize their own data structures and remain pinned to their socket.
Production systems enforce NUMA locality to avoid performance collapse. A database server with two sockets might partition data by key range, allocating shard A's buffer pool on node 0 and shard B's on node 1, with worker threads pinned accordingly. If locality is violated and 50% of accesses become remote, effective memory latency increases from 80 nanoseconds to 120 nanoseconds (1.5x), and aggregate throughput can drop by 20% to 30% due to bandwidth contention and higher latency. Monitoring tools track local versus remote memory accesses; if remote accesses exceed 10% to 20%, throughput loss becomes measurable.
Improving NUMA locality requires co locating threads and data. Use CPU affinity to pin threads to specific cores. Initialize data structures on the same thread (and thus NUMA node) that will access them. For shared data, replicate per node copies or use interleaved allocation as a fallback, trading bandwidth for lower worst case latency. Google and Amazon's large memory database instances carefully partition data and pin threads to avoid cross socket traffic.
💡 Key Takeaways
•NUMA systems have per socket local memory. Local access costs 60 to 80 nanoseconds. Remote cross socket access costs 100 to 160 nanoseconds (1.3x to 2.0x slower) and consumes inter socket bandwidth.
•First touch policy allocates physical frames on the NUMA node of the CPU that first accesses (faults in) the page. Thread migration to another socket causes remote accesses.
•Remote memory accesses reduce throughput. If 50% of accesses are remote, effective latency increases by 30% to 50%, and system throughput can drop 20% to 30% due to bandwidth contention.
•Pin threads to CPUs using affinity. Initialize data on the same thread (and NUMA node) that will access it. This ensures first touch places memory locally.
•Monitor local versus remote memory access ratios using performance counters. If remote accesses exceed 10% to 20%, expect measurable performance loss.
•For shared data, replicate per node copies or use interleaved allocation. Replication consumes more memory but avoids remote access. Interleaving spreads pages across nodes, reducing worst case latency at the cost of average throughput.
📌 Examples
A database server with two sockets partitions shards by key range. Shard A (keys 0 to 1M) allocated on socket 0 with worker threads pinned to socket 0 cores. Shard B (keys 1M to 2M) on socket 1. This keeps 95%+ accesses local, maximizing throughput.
A misconfigured application allocates a 64 GB working set on socket 0 but runs worker threads on socket 1. All accesses are remote, doubling memory latency from 80 ns to 160 ns and cutting throughput by 35%.
Google's large memory instances use numactl to interleave allocation for shared read only datasets (like embedding tables), spreading pages across sockets. This trades 10% to 15% throughput for balanced load and better worst case latency.