Production Deployment: Sharding, Caching, and Failure Modes

Sharding for Scale
A single autocomplete server with 50 million terms uses 2 to 4 GB RAM and handles 10,000 to 50,000 QPS. For 500,000 QPS, you need 10 to 50 servers. Sharding by first character distributes load: shard A handles prefixes starting with "a", shard B handles "b". This reduces memory per server to approximately 100 to 150 MB per shard, easily fitting in L3 cache. Downside: query distribution is uneven. In English, "s" prefixes are 3x more common than "z", so the "s" shard receives 3x the traffic. Hash based sharding spreads load evenly but requires scatter gather for queries.
Multi Tier Caching
Autocomplete follows extreme power law: top 1 percent of prefixes ("a", "th", "ca") account for 50 percent of queries. Cache aggressively. Tier 1: in process LRU cache (10 to 50 MB, sub microsecond access) holds hottest 10,000 prefix results. Tier 2: Redis or Memcached cluster (1 to 10 GB, 0.1 to 0.5 ms access) holds top 1 million. Tier 3: actual trie (2 to 4 GB, 0.5 to 2 ms access). With proper tiering, 95 percent of queries hit Tier 1 or 2, achieving p50 latency under 1 ms.
Atomic Updates
Most dictionaries update daily (new trending terms, frequency refreshes). The trie is immutable during serving. Updates happen by building a new trie in background (5 to 10 minutes for 50 million terms), then atomically swapping the pointer. During swap, in flight queries complete against old trie; new queries use new one. No downtime, no inconsistency. Use versioned snapshots and switch all shards atomically after rebuild finishes.
Failure Modes and Graceful Degradation
Unlike search, autocomplete is non critical: users can type full query and press enter. Fail open: if service is down, hide the suggestion dropdown rather than showing errors. Monitor p99 latency: if suggestions take longer than 100 ms, users perceive lag and ignore them. Cold start thundering herd: after restart, empty caches cause all queries to hit trie. Warm caches before adding server to load balancer by replaying recent query logs.
Latency Budget
Total 50 ms budget from keystroke to suggestion. Network round trip: 5 to 20 ms depending on user distance. Server processing: 0.1 to 2 ms (cache hit) or 2 to 5 ms (trie lookup). Response serialization: 0.1 to 0.5 ms. Network dominates, which is why major services deploy autocomplete at edge locations, reducing round trips to under 10 ms for most users. Geographic sharding with regional clusters achieves sub 50 ms p95 worldwide.

💡 Key Takeaways

✓Single server handles 10K to 50K QPS with 2 to 4 GB RAM; for 500K QPS need 10 to 50 servers with sharding

✓Multi tier caching: L1 in process (sub microsecond), L2 Redis (0.1 to 0.5 ms), L3 trie (0.5 to 2 ms); 95 percent hit L1 or L2

✓Atomic updates: build new trie in 5 to 10 minutes, swap pointer; no downtime, no inconsistency

✓Fail open: hide dropdown if service down since autocomplete is non critical; monitor p99 latency threshold 100 ms

✓Network dominates latency budget; edge deployment reduces round trip to under 10 ms; total budget 50 ms

📌 Interview Tips

1Walk through the tiered caching: hot prefixes like a and th hit L1 sub microsecond, cold prefixes fall through to L2 then L3.

2Explain the cold start problem: after restart, empty cache means all queries hit trie. Warm from query logs before serving.

← Back to Search Autocomplete (Trie) Overview