NUMA Memory Locality and Cross-Socket Access Costs

NUMA Architecture Basics
NUMA (Non Uniform Memory Access) means memory access time depends on which CPU accesses which memory. Each CPU socket has local memory attached. Accessing local memory takes 80 to 100 nanoseconds. Accessing remote memory on another socket takes 150 to 300 nanoseconds. The ratio is called the NUMA factor, typically 1.5x to 3x.
Modern servers have 2 to 8 sockets. A 2 socket server has 2 NUMA nodes. Memory on socket 0 is local to CPUs on socket 0 and remote to CPUs on socket 1. The OS tracks which memory belongs to which node. Processes ideally run on CPUs with local access to their memory.
Impact On Performance
A process scattered across NUMA nodes suffers consistent memory latency penalty. If half your memory accesses are remote, average latency increases 25 to 50 percent. For memory bound workloads, throughput drops proportionally.
Worse, cross socket traffic saturates the interconnect. The QPI or UPI link between sockets has limited bandwidth, typically 20 to 40 GB per second. When multiple processes hammer remote memory, they compete for interconnect bandwidth. Contention adds latency beyond the base remote access cost.
NUMA Aware Scheduling
Numactl: Pin processes to specific NUMA nodes. The process runs only on that node CPUs and allocates memory from that node. Ideal for processes whose working set fits in one node memory.
Interleave: Spread allocations across all nodes. No locality benefit, but predictable average latency. Good for shared data structures accessed by threads on all sockets.
First touch: Default policy allocates pages on the node where first accessed. If initialization runs on node 0 but workload runs on node 1, all memory is remote. Move initialization or explicitly migrate pages.
💡 Key Insight: Memory bound applications on multi socket servers must be NUMA aware. A 2x latency penalty on 50% of accesses reduces throughput by 25% or more. Use numactl to pin workloads and numastat to monitor locality.

💡 Key Takeaways

✓NUMA: local memory 80 to 100 ns, remote memory 150 to 300 ns, 1.5x to 3x ratio

✓Cross socket traffic competes for limited interconnect bandwidth at 20 to 40 GB per second

✓First touch policy: memory allocated on node where first accessed, not where used most

✓Use numactl to pin processes to nodes; interleave for shared cross node data

✓50% remote access with 2x latency penalty reduces throughput by roughly 25%

📌 Interview Tips

1Explain NUMA factor: if local access is 100 ns and remote is 200 ns, that is 2x penalty on every remote access

2When discussing multi socket servers, ask about NUMA topology. Use numactl --hardware to see node layout

3For performance issues, check numastat for memory node distribution. High remote access indicates poor locality

← Back to Memory Management & Virtual Memory Overview