OS & Systems FundamentalsCPU Scheduling & Context SwitchingMedium⏱️ ~3 min

Linux CFS Scheduler and Production Impact

The Linux Completely Fair Scheduler (CFS) is the default CPU scheduler that aims to give each runnable thread equal CPU time weighted by priority. CFS tracks virtual runtime for each thread, essentially how much CPU time it has consumed adjusted by its nice value. It picks the thread with the smallest virtual runtime from a per core red black tree in O(log N) time. The key parameters that affect production systems are target latency (default 6 milliseconds) and minimum granularity (default 0.75 milliseconds). With 8 runnable threads on a core, each gets roughly 0.75 millisecond time slices. With 1000 runnable threads, each still gets 0.75 millisecond slices because granularity is the floor. This means more runnable threads directly increases context switch frequency and cache churn without improving fairness. In practice, systems running hundreds of runnable threads per core see noticeable scheduler overhead. The scheduler itself consumes several percent of CPU time managing runqueues and load balancing. More critically, frequent 0.75 millisecond preemptions trash instruction and data caches. A thread that gets preempted loses its hot cache lines, and when it resumes, it suffers cache misses for the next few microseconds. Across thousands of switches per second, this cache pollution inflates p99 latencies by tens of milliseconds in latency sensitive services. CFS also implements load balancing across cores every few milliseconds, migrating threads from busy cores to idle ones. While this improves overall utilization, migrations incur the highest context switch costs because the thread's cache affinity is completely broken. Cross core migrations trigger Translation Lookaside Buffer (TLB) shootdowns via inter processor interrupts, and the migrated thread must rebuild its working set in the new core's cache hierarchy.
💡 Key Takeaways
CFS uses target latency of 6 milliseconds and minimum granularity of 0.75 milliseconds to determine time slices per thread
With 1000 runnable threads per core, each gets 0.75 millisecond slices, causing over 1300 context switches per second per core and severe cache thrashing
Scheduler overhead becomes measurable at hundreds of runnable threads: several percent of CPU time spent in runqueue management and load balancing code paths
Cross core migrations from load balancing trigger TLB shootdowns and cache misses, adding 5 to 15 microseconds per migration plus cache warmup penalty
Systems should target fewer than 10 to 20 runnable threads per core for latency sensitive workloads to keep context switch overhead below 1 percent of CPU time
📌 Examples
A Java service with 200 application threads per instance on an 8 core machine has 25 runnable threads per core on average, causing 33 context switches per millisecond per core and adding 5 to 10 milliseconds to p99 latency
NGINX uses one worker thread per core, handling 10,000 to 100,000 concurrent connections with minimal context switches, keeping scheduler overhead under 0.1 percent
← Back to CPU Scheduling & Context Switching Overview