Two Stage Fuzzy Search Pipeline: Fast Candidate Generation
Two Stage Architecture
Production fuzzy search uses a two stage pipeline to achieve both speed and accuracy. Stage one performs fast candidate generation, reducing a vocabulary of millions to hundreds or thousands of candidates in 1 to 5 milliseconds. Stage two computes precise edit distance on this reduced set, taking another 5 to 20 milliseconds. This separation allows using approximate but fast methods for filtering followed by exact but slower methods for verification.
N-gram Indexing
N-gram indexing is the most common candidate generation technique. An n-gram is a contiguous sequence of n characters from a string. "restaurant" with trigrams (n equals 3) produces: res, est, sta, tau, aur, ura, ran, ant. Each n-gram maps to a posting list of terms containing it. For a query, generate its n-grams and retrieve terms sharing at least k n-grams using an inverted index lookup. The threshold k is typically length minus (n times edit_distance): for a 10 character term with edit distance 2 and trigrams, require at least 10 minus (3 times 2) equals 4 matching trigrams. This filters candidates while guaranteeing no false negatives within the edit distance threshold.
Levenshtein Automata
Levenshtein automata provide an alternative that is particularly efficient for prefix matching. A Levenshtein automaton for query q and distance d is a finite state machine that accepts exactly the strings within edit distance d of q. Once constructed (takes O(q squared times d) time), the automaton can traverse a sorted term dictionary or trie in linear time, emitting all matches without computing pairwise distances. For search as you type interfaces generating suggestions from 100,000 terms, automata based traversal completes in 1 to 3 milliseconds versus 50+ milliseconds for n-gram retrieval plus verification.
BK Trees
BK trees (Burkhard Keller trees) organize terms by edit distance for efficient nearest neighbor queries. Each node stores a term; children are partitioned by their edit distance to the parent. Searching for terms within distance d of query q traverses only subtrees where the triangle inequality cannot rule out matches. For a vocabulary of 1 million terms, BK tree search examines 10,000 to 50,000 candidates (1% to 5%) rather than all 1 million, completing in 5 to 15 milliseconds.