Memory Compression: DAWG, FST, and Double Array Tries

Why Basic Tries Waste Memory
A standard trie node needs one pointer per possible child character. With 26 lowercase letters, that is 26 times 8 bytes equals 208 bytes per node, even if most pointers are null. For 50 million nodes, you need 10 GB just for child pointers. In practice, most nodes have only 1 to 3 children. The average node wastes 180 plus bytes on null pointers. With 256 characters for full ASCII, waste grows to 2 KB per node. This is why production autocomplete systems never use naive tries.
DAWG Directed Acyclic Word Graph
A DAWG compresses tries by sharing common suffixes as well as prefixes. The words "walking", "talking", "stalking" share the suffix "alking". A DAWG merges these suffixes into a single subgraph, reducing node count by 50 to 70 percent for typical English dictionaries. A 50 million word trie with 30 million nodes compresses to 10 to 12 million DAWG nodes. Trade off: construction requires suffix based merging (more complex than trie insertion), and you cannot store per word metadata at suffix shared nodes since multiple words pass through them.
FST Finite State Transducers
FSTs extend DAWGs by associating output values with transitions. As you traverse "california", the FST outputs its frequency score incrementally. Lucene uses FSTs for its term dictionary, achieving 10 to 20x compression over tries. A 500 MB trie compresses to 25 to 50 MB as an FST. The data structure is immutable (must rebuild to update) and read optimized, perfect for autocomplete where dictionary changes infrequently. FSTs pack transitions as delta offsets in contiguous arrays, improving cache locality over pointer chasing.
Double Array Trie
Instead of pointer per child, store all nodes in two parallel arrays: BASE and CHECK. BASE[i] plus character equals index of child for that character. CHECK[j] confirms the parent was correct. This eliminates null pointers entirely. A 30 million node trie fits in 240 MB (two 4 byte integers per node) versus 6 plus GB for naive tries. Lookup is two array accesses per character, cache friendly and predictably fast with no pointer chasing.
Memory Comparison for 50M Words
Naive trie: 10 to 12 GB memory, 0.05 ms lookups. DAWG: 2 to 3 GB, 0.08 ms lookups (suffix traversal overhead). FST: 0.5 to 1 GB, 0.15 ms lookups (decode overhead). Double array: 0.8 to 1.2 GB, 0.04 ms lookups (cache efficient). For most autocomplete use cases, double array tries offer the best balance of memory efficiency and query speed.

💡 Key Takeaways

✓Naive tries waste memory: 26 pointers times 8 bytes equals 208 bytes per node even with 1 to 3 actual children; 50 million nodes needs 10 GB

✓DAWG shares suffixes as well as prefixes, reducing node count 50 to 70 percent; 30 million nodes compress to 10 to 12 million

✓FST achieves 10 to 20x compression; Lucene uses FSTs for term dictionaries, packing 500 MB tries into 25 to 50 MB

✓Double array uses BASE and CHECK arrays for O(1) child lookup with no null pointers; 30 million nodes fit in 240 MB

✓Trade offs: double array 0.04 ms with 0.8 to 1.2 GB; FST 0.15 ms with 0.5 to 1 GB; DAWG 0.08 ms with 2 to 3 GB

📌 Interview Tips

1Compare memory for 50M words: naive 10 GB, DAWG 2 to 3 GB, FST 0.5 to 1 GB, double array 0.8 to 1.2 GB. Know the 10 to 20x compression from FST.

2Explain DAWG suffix sharing: walking, talking, stalking share the alking suffix. This is beyond prefix sharing that basic tries provide.

← Back to Search Autocomplete (Trie) Overview