Common Context Switching Failure Modes at Scale
Lock Convoy
Lock convoy happens when many threads compete for one lock. A thread acquires the lock, gets preempted while holding it, and all other threads pile up waiting. When the holder runs again and releases, another thread acquires, gets preempted, and the cycle continues.
The result is serialization. Even with many cores, only one thread progresses. Context switches multiply because waiting threads keep waking and blocking. Throughput collapses to worse than single threaded. Detect by monitoring lock hold times and contention counts.
Priority Inversion
Priority inversion occurs when a high priority thread waits for a lock held by a low priority thread. The low priority thread cannot run because medium priority threads preempt it. The high priority thread effectively runs at lower priority than medium threads.
Solutions include priority inheritance: temporarily boost lock holder to highest waiter priority. This ensures the holder completes quickly. Alternatively, use lock free data structures or redesign to eliminate the problematic lock.
Runaway Thread Count
Too many runnable threads overwhelms the scheduler. With 1000 threads on 32 cores, each thread gets tiny time slices. Context switches dominate. Cache thrashing destroys performance. The scheduler itself becomes a bottleneck at very high thread counts.
Symptoms include high context switch rate (tens of thousands per second), high CPU sys time, and low actual throughput despite high CPU utilization. The fix is reducing thread count to match core count or using thread pools with bounded size.