OS & Systems FundamentalsCPU Scheduling & Context SwitchingMedium⏱️ ~3 min

Kubernetes CPU Throttling and CFS Bandwidth Control

Kubernetes enforces CPU limits using the Linux CFS bandwidth control mechanism. When you set a CPU limit on a container, Kubernetes configures a quota and period. The default period is 100 milliseconds. If your container has a limit of 2 CPUs, it gets a quota of 200 milliseconds per 100 millisecond period. Once the container consumes its quota, it is throttled (not scheduled) until the next period starts. This creates a hard performance cliff. If your service experiences a burst of traffic and tries to use 2.1 CPUs for 50 milliseconds, it will hit its quota after roughly 95 milliseconds and then be completely throttled for the remaining 5 milliseconds of the period. This manifests as p99 latency spikes of 5 to 100 milliseconds depending on when in the period the throttling occurs. Many production teams have observed that CPU throttling is the single largest contributor to tail latency in containerized services. The problem is worse than it appears because quota accounting happens per CPU. With multiple threads, you can hit quota even when average utilization is below the limit due to scheduling artifacts and burstiness. Google Kubernetes Engine (GKE) guidance explicitly recommends avoiding CPU limits for latency critical workloads or reducing the period from 100 milliseconds to 1 millisecond to cap throttle duration at 1 millisecond instead of 100 milliseconds. The tradeoff is more frequent quota accounting overhead and timer interrupts. In practice, many organizations run latency sensitive services with CPU requests but no limits, relying on overprovisioning and pod density limits to prevent resource contention. For batch workloads where throughput matters more than latency, limits with the default 100 millisecond period are acceptable. For services with Service Level Objectives (SLOs) on p99 latency, either remove limits entirely or set them 50 to 100 percent above typical peak usage to provide headroom for bursts.
💡 Key Takeaways
Default CFS bandwidth period of 100 milliseconds means throttled containers wait up to 100 milliseconds for the next period, causing p99 latency spikes
Many teams report p99 latency improvements of tens of milliseconds after removing CPU limits from latency sensitive services
Reducing CFS period from 100 milliseconds to 1 millisecond caps throttle duration at 1 millisecond but increases quota accounting overhead and timer interrupt frequency
CPU limits are enforced per core, so multi threaded applications can hit quota below average limit due to scheduling artifacts and burstiness across cores
Google GKE recommends avoiding CPU limits for latency critical workloads or setting limits 50 to 100 percent above peak usage to provide burst headroom
📌 Examples
A service with 2 CPU limit and 1.8 CPU average usage experiences 10 to 100 millisecond p99 latency spikes during traffic bursts when quota is exhausted mid period
Removing CPU limits from a Java API service reduced p99 latency from 85 milliseconds to 45 milliseconds under the same load, eliminating throttling induced stalls
Setting CPU period to 1 millisecond reduced worst case throttle stalls from 100 milliseconds to 1 millisecond, improving p999 latency by 50 milliseconds at cost of 0.5 percent CPU overhead for accounting
← Back to CPU Scheduling & Context Switching Overview
Kubernetes CPU Throttling and CFS Bandwidth Control | CPU Scheduling & Context Switching - System Overflow