Search & Ranking SystemsSearch Autocomplete (Trie)Hard⏱️ ~3 min

Production Deployment: Sharding, Caching, and Failure Modes

A single trie instance can handle tens of thousands of queries per second for cache resident hot prefixes, but real autocomplete systems face uneven load, memory constraints, and update coordination challenges. Production deployments shard tries by prefix hash or first N characters to distribute load, replicate hot shards (prefixes starting with "s", "a" in English see 10 to 20 times more traffic), and use multi tier caching to absorb traffic spikes without hitting storage. Sharding strategy impacts hotspot distribution. Hashing the full prefix spreads load evenly but requires scatter gather for fuzzy queries (edit distance candidates may hash to different shards). Sharding by first one or two characters keeps prefix locality but creates hotspots on common letters. Hybrid approaches use consistent hashing with virtual nodes, co locate shards with frontends to minimize network Round Trip Time (RTT), and replicate the top 1,000 hottest prefixes to dedicated high capacity shards. Google and Amazon reportedly use geographic sharding for location autocomplete, routing queries to regional clusters to reduce latency and incorporate local popularity. Caching is layered. L1 is in process memory holding compressed trie plus topK for the hottest prefixes (targeting greater than 99 percent hit rate on prefixes length two or more). L2 is a cross node cache (Redis, Memcached) for cold prefixes to avoid repeated deep trie traversals. Cache keys are normalized prefix (lowercase, diacritic folded) plus version tag; values are topK lists. TTL aligns with snapshot rebuild cadence (five to thirty minutes). Warmup is critical: before serving traffic on a new instance, preload L1 from traffic logs of the past hour to avoid cold start latency spikes. Failure modes cluster around consistency and abuse. Distributed snapshot updates can leave shards serving different versions temporarily, causing the same prefix to return different suggestions depending on which shard answers. Use versioned snapshots and monotonic readers: only serve from the latest completed snapshot per shard, atomically switch all shards when rebuild finishes. Abuse is another vector: attackers can spam the system with generated queries to pollute suggestions with malicious content. Apply rate limits per client, blocklists for known bad patterns, and k anonymity thresholds at both index build time and serve time. Human in the loop review for high impact prefixes (top 10,000 by traffic) catches edge cases automated filters miss.
💡 Key Takeaways
Sharding by prefix hash spreads load evenly but requires scatter gather for fuzzy queries. Sharding by first one or two characters preserves locality but creates hotspots on common letters ("s", "a" see 10 to 20 times more traffic in English).
Multi tier caching absorbs traffic spikes. L1 in process (greater than 99 percent hit for hot prefixes, less than one ms latency), L2 cross node cache (cold prefixes, plus two to five ms), fallback to deep trie traversal (plus 10 to 50 ms). Warmup L1 from traffic logs before serving to avoid cold start.
Distributed snapshot inconsistency causes same prefix to return different suggestions across shards. Use versioned snapshots and monotonic readers: only serve from latest completed snapshot per shard, atomically switch all shards after rebuild finishes.
Abuse and spam pollute suggestions with malicious content. Apply rate limits per client, blocklists, k anonymity thresholds at index build and serve time. Human review for top 10,000 prefixes by traffic catches edge cases automated filters miss.
Geographic sharding reduces latency and incorporates local popularity. Route queries to regional clusters, replicate hot global prefixes across regions. Google and Amazon use this for location autocomplete to serve sub 50 ms P95 latency worldwide.
📌 Examples
Amazon product autocomplete: 50 shards by prefix hash (first three characters), each shard serves 4,000 QPS peak. Hot prefixes ("ipho", "play", "airp") replicated to three dedicated high capacity shards. L1 cache hit rate 99.2 percent, P99 latency 8 ms.
Google search: Geographic sharding with regional clusters in US West, US East, EU, Asia. Queries routed to nearest cluster (30 to 50 ms RTT savings). Each region maintains local popularity ranking ("warriors" ranks high in San Francisco, low in New York), falls back to global ranking for rare prefixes.
Uber place autocomplete: Consistent hashing with 200 virtual nodes per physical shard. Top 1,000 hottest city prefixes ("san", "new", "los") replicated to all shards. Shard rebalancing during traffic spikes (New Year's Eve) completes in under five minutes without dropping requests.
Airbnb listing search: Snapshot rebuild every 15 minutes. Version tags (v1234) appended to cache keys and shard metadata. Atomic switch: all shards finish rebuilding v1234, coordinator broadcasts switch signal, shards begin serving v1234 within two seconds. Prevents user seeing mixed results from v1233 and v1234.
Spam attack on small ecommerce autocomplete: Attacker floods 10,000 queries for "buy fake reviews". K anonymity threshold (50 users) blocks suggestion from appearing. Rate limiter caps attacker to 100 queries per minute per IP. Human review flags pattern, adds "fake reviews" to blocklist within one hour.
← Back to Search Autocomplete (Trie) Overview
Production Deployment: Sharding, Caching, and Failure Modes | Search Autocomplete (Trie) - System Overflow