Two Stage Fuzzy Search Pipeline: Fast Candidate Generation

Two Stage Architecture
Production fuzzy search uses a two stage pipeline to achieve both speed and accuracy. Stage one performs fast candidate generation, reducing a vocabulary of millions to hundreds or thousands of candidates in 1 to 5 milliseconds. Stage two computes precise edit distance on this reduced set, taking another 5 to 20 milliseconds. This separation allows using approximate but fast methods for filtering followed by exact but slower methods for verification.
N-gram Indexing
N-gram indexing is the most common candidate generation technique. An n-gram is a contiguous sequence of n characters from a string. "restaurant" with trigrams (n equals 3) produces: res, est, sta, tau, aur, ura, ran, ant. Each n-gram maps to a posting list of terms containing it. For a query, generate its n-grams and retrieve terms sharing at least k n-grams using an inverted index lookup. The threshold k is typically length minus (n times edit_distance): for a 10 character term with edit distance 2 and trigrams, require at least 10 minus (3 times 2) equals 4 matching trigrams. This filters candidates while guaranteeing no false negatives within the edit distance threshold.
Levenshtein Automata
Levenshtein automata provide an alternative that is particularly efficient for prefix matching. A Levenshtein automaton for query q and distance d is a finite state machine that accepts exactly the strings within edit distance d of q. Once constructed (takes O(q squared times d) time), the automaton can traverse a sorted term dictionary or trie in linear time, emitting all matches without computing pairwise distances. For search as you type interfaces generating suggestions from 100,000 terms, automata based traversal completes in 1 to 3 milliseconds versus 50+ milliseconds for n-gram retrieval plus verification.
BK Trees
BK trees (Burkhard Keller trees) organize terms by edit distance for efficient nearest neighbor queries. Each node stores a term; children are partitioned by their edit distance to the parent. Searching for terms within distance d of query q traverses only subtrees where the triangle inequality cannot rule out matches. For a vocabulary of 1 million terms, BK tree search examines 10,000 to 50,000 candidates (1% to 5%) rather than all 1 million, completing in 5 to 15 milliseconds.

💡 Key Takeaways

✓Two stage pipeline: candidate generation in 1 to 5 milliseconds reduces millions to thousands, then precise edit distance on candidates takes 5 to 20 milliseconds

✓N-gram indexing: split terms into overlapping character sequences (trigrams for restaurant: res, est, sta...), require k matching n-grams where k equals length minus (n times edit_distance)

✓Levenshtein automata: finite state machine accepting strings within edit distance d, traverses sorted dictionary in linear time; autocomplete on 100K terms in 1 to 3 milliseconds

✓BK trees partition vocabulary by edit distance, pruning using triangle inequality; searches 1% to 5% of 1 million terms (10K to 50K candidates) in 5 to 15 milliseconds

✓N-gram threshold formula guarantees no false negatives: 10 character term with trigrams and distance 2 requires 4+ matching trigrams (10 minus 6 equals 4)

📌 Interview Tips

1Explain n-gram candidate generation: restaurant produces 8 trigrams, query resturant shares 5 trigrams with restaurant, exceeding threshold of 4 for distance 2.

2Compare approaches: n-grams work well for full term matching, Levenshtein automata excel at prefix/autocomplete, BK trees efficient for nearest neighbor with varying distances.

3Discuss the 1% to 5% candidate reduction: BK tree on 1 million terms examines only 10K to 50K candidates using triangle inequality pruning, achieving sub 20 millisecond latency.

← Back to Fuzzy Search & Typo Tolerance Overview