Search & Ranking Systems • Search Autocomplete (Trie)Hard⏱️ ~2 min
Memory Compression: DAWG, FST, and Double Array Tries
A naive trie storing "sea", "seal", and "season" uses separate nodes for each character with pointer arrays for children. For a large alphabet (Unicode, emoji, mixed case), each node might allocate space for 128 or 256 child pointers, wasting memory when most are null. A 500,000 word English dictionary balloons to 500 to 900 MB. Compressed automata solve this by sharing structure and packing transitions.
Directed Acyclic Word Graphs (DAWGs) deduplicate identical suffixes. Words ending the same way ("cats" and "bats" share "ats") point to the same suffix subtree. Finite State Transducers (FSTs) go further by mapping prefixes to compact payloads (document IDs, scores) and encoding transitions as delta offsets in a single contiguous array. Double array tries use two parallel arrays (base and check) to encode child transitions with O(1) lookup, achieving three to ten times memory reduction over naive tries and dramatically better CPU cache locality.
The tradeoff is update complexity. A naive mutable trie supports incremental inserts in O(P) time. Compressed automata are often append only or require batch rebuilds because suffix sharing and packed arrays make in place updates expensive or impossible. Google and Amazon rebuild their autocomplete tries every five to thirty minutes from aggregated logs rather than updating per query. For use cases needing real time updates (collaborative editing, live chat), you layer a small mutable trie over a large compressed base and periodically merge.
Another edge case is multilingual corpora. A compressed trie optimized for English (26 letters) may not compress well for Chinese (thousands of characters) or emoji (high Unicode code points with sparse child distributions). Systems use normalization (case folding, diacritic removal) and specialized encoding (variable length codes for transition targets) to manage alphabet explosion, but memory per entry still climbs. Uber and Airbnb, serving global user bases, report segmenting tries by language or region to contain memory per shard.
💡 Key Takeaways
•Naive tries waste memory on sparse child pointer arrays. For a 256 character alphabet, each node allocates 256 pointers even if only two are used, causing 500,000 word dictionaries to consume 500 to 900 MB.
•DAWGs deduplicate identical suffixes by sharing subtrees. Words like "cats" and "bats" share the "ats" suffix nodes, reducing redundancy. FSTs encode transitions as delta offsets in contiguous arrays, improving cache locality.
•Double array tries use two parallel arrays (base and check) for O(1) child access with three to ten times memory savings over naive tries. Critical for fitting multilingual corpora (millions of entries) in RAM.
•Compressed automata trade update flexibility for memory. Most are append only or require batch rebuilds. Production systems rebuild every five to thirty minutes from logs rather than supporting per query incremental updates.
•Multilingual corpora challenge compression. Chinese (thousands of characters) and emoji (sparse high Unicode) reduce compression ratios. Systems normalize (case folding, diacritic removal) and shard by language to manage alphabet explosion.
📌 Examples
Amazon product autocomplete: Uses FST to map prefixes to product ID lists. 10 million products compressed to 2 GB (versus 15 GB naive trie), rebuilt every 15 minutes from purchase and search logs. Serves 200,000 QPS from memory with P99 latency under 10 ms.
Google search suggestions: Reported to use DAWG like suffix deduplication for multilingual query corpus (billions of unique queries). Sharded by language and region, each shard rebuilt every 10 to 20 minutes, with hot prefix topK cached in process memory.
Uber place autocomplete: Double array trie for city and address prefixes, segmented by geographic region. North America trie (5 million places) fits in 800 MB compressed versus 6 GB naive, enabling sub 5 ms P95 latency with in memory serving.
Real time chat autocomplete: Maintains small mutable trie (100,000 recent messages) overlaid on large compressed base (10 million historical messages as FST). Mutable layer handles real time updates, merged into base FST nightly to prevent unbounded growth.