DNS Performance Budgets and Latency Optimization

Production systems typically target keeping total DNS resolution time under 50 ms at the 50th percentile and under 150 ms at the 95th percentile for globally distributed consumer applications. Achieving these targets requires understanding the latency sources: cache hits at recursive resolvers complete in single digit to tens of milliseconds, while cache misses requiring full hierarchy traversal add 50 to 300 ms depending on geographic distance and network congestion. Each CNAME hop in a resolution chain adds one additional round trip, making chains longer than 2 CNAMEs a significant contributor to tail latency.

Modern resolver implementations use several techniques to control latency. Request coalescing prevents thundering herd problems when TTLs expire by combining concurrent identical queries into a single upstream request. Prefetching hot names before TTL expiry (when hit count exceeds thresholds) eliminates cache miss latency for popular domains. Serve stale mechanisms allow resolvers to return expired cached answers while revalidating in the background, protecting p95 and p99 latency during upstream slowness. Cloudflare's resolver achieves sub 100 ms global p50 latency partly through heavy use of persistent transport (DNS over TLS and DNS over HTTPS with connection reuse) which amortizes TLS handshake costs.

For authoritative operators, minimizing response size is critical. Large responses from DNSSEC signatures, many A/AAAA records, or long TXT records can exceed path MTU (typically 1280 to 1500 bytes), causing IP fragmentation which is often dropped by middleboxes. Truncated responses force TCP fallback, adding one full RTT and significantly more CPU load on both client and server. Production best practice keeps responses under 1232 bytes to avoid fragmentation and uses multiple sharded names rather than huge single RRsets when many records are needed.

💡 Key Takeaways

✓Target DNS performance budgets of under 50 ms p50 and under 150 ms p95 for global consumer apps; each CNAME hop adds one RTT potentially increasing tail latency by 20 to 100 ms

✓Cache miss paths adding 50 to 300 ms are rare at scale due to 80 to 95 percent hit rates; prefetching hot names before TTL expiry eliminates miss latency for popular domains

✓Serve stale while revalidate protects tail latency by returning expired cache entries during upstream slowness, but can prolong exposure to dead endpoints if not integrated with health signals

✓Responses exceeding path MTU (1280 to 1500 bytes) cause fragmentation often dropped by middleboxes; truncation forcing TCP retry adds one full RTT and 3 to 10x server CPU cost

✓DNSSEC adds 1 to 3 ms validation CPU per cold query and increases response size by hundreds of bytes, requiring careful consideration of query performance versus security benefits

✓Request coalescing at resolvers combines concurrent identical queries during TTL expiry, preventing query storms; without it, a popular domain expiring can generate thousands of simultaneous upstream queries

📌 Interview Tips

1A two CNAME chain (example.com to cdn.provider.com to edge.server.com) requires three separate DNS queries, potentially adding 60 to 200 ms total latency compared to a direct A record response

2Google Public DNS prefetches high traffic names when hit count exceeds thresholds and TTL drops below 10 percent of original value, keeping resolution time under 20 ms even as TTLs expire

3A DNSSEC signed zone with multiple RRSIG records totaling 1800 bytes triggers truncation over IPv4 (1500 byte MTU), forcing TCP retry that increases median latency from 15 ms to 45 ms (additional TCP handshake plus query)

← Back to DNS & Domain Resolution Overview