Networking & Protocols • TCP vs UDP Trade-offsMedium⏱️ ~3 min
Latency, Throughput, and CPU Trade offs in TCP vs UDP Implementations
The performance characteristics of TCP and UDP extend beyond protocol semantics to implementation and hardware acceleration. TCP benefits from mature kernel implementations with extensive NIC offloads: hardware checksum computation, TCP segmentation offload (TSO), and large receive offload (LRO) reduce CPU cycles per byte. Kernel TCP stacks have been tuned for decades; operators get well understood congestion control algorithms, established observability via netstat and ss, and predictable resource consumption. For bulk throughput workloads, kernel TCP can saturate 10 to 100 Gbps links with modest CPU usage when properly tuned.
UDP based user space transports like QUIC trade kernel efficiency for control and latency optimization. Moving transport logic to user space incurs higher CPU per packet and per byte: early QUIC implementations required 50 to 100% more CPU than kernel TCP plus TLS for equivalent throughput. Modern QUIC stacks have narrowed this gap through batching, optimized crypto libraries, and careful timer coalescing, but operators still plan for additional CPU headroom. The benefit is granular control: custom congestion control, per stream prioritization, selective retransmission with deadlines, and forward error correction tuned to observed loss patterns. On lossy mobile networks with 1 to 3% loss, this control translates to 50 to 200 ms tail latency improvements that justify the CPU cost.
Packet rate becomes a bottleneck with small UDP messages. Gaming and IoT workloads sending 50 to 300 byte packets at high frequency generate more NIC interrupts and context switches than bulk TCP flows. Systems must implement batching, consider kernel bypass techniques like DPDK, and tune interrupt coalescing. Capacity planning for UDP based services should budget for higher packet processing overhead: measure CPU per million packets, not just per Gbps. On mobile clients, user space crypto and frequent NAT keepalives increase battery drain, requiring careful tuning of keepalive intervals (typically 30 to 120 seconds) to balance NAT binding lifetime and power consumption.
💡 Key Takeaways
•Kernel TCP with NIC offloads can saturate 10 to 100 Gbps links with lower CPU than user space UDP transports; early QUIC stacks required 50 to 100% more CPU for equivalent throughput
•Modern QUIC implementations have narrowed the CPU gap through batching and optimized crypto, but operators still plan for double digit percentage increases in CPU budget compared to kernel TCP with TLS offload
•UDP based transports justify higher CPU cost by eliminating 50 to 200 ms of tail latency on lossy mobile networks through selective retransmission and stream isolation
•Small UDP packets (50 to 300 bytes) in gaming and IoT generate higher NIC interrupt rates and context switches; capacity planning must account for CPU per million packets, not just per Gbps
•Mobile clients running user space UDP transports incur battery drain from crypto and NAT keepalives every 30 to 120 seconds; keepalive tuning balances NAT binding persistence and power consumption
•Observability gaps arise when moving from kernel TCP to user space UDP stacks; operators need replacement metrics for loss rate, RTT distribution, reordering percentage, and per stream head of line blocking time
📌 Examples
Google's early QUIC deployments on YouTube required provisioning 30 to 50% additional edge CPU compared to TCP serving, but eliminated 18 to 30% of rebuffers on mobile
High frequency trading firms using kernel bypass and UDP can process millions of packets per second with sub microsecond latency, but require specialized hardware and NIC tuning
A gaming backend sending 128 Hz updates (7.8 ms per packet) with 200 byte payloads generates 128 packets per second per client; at 10,000 concurrent clients that is 1.28 million packets per second requiring careful CPU budgeting
Microsoft MsQuic library reduced CPU overhead to within 20 to 30% of kernel TCP by batching system calls and optimizing per packet crypto operations