Search & Ranking SystemsFuzzy Search & Typo ToleranceMedium⏱️ ~2 min

Two Stage Fuzzy Search Pipeline: Fast Candidate Generation

Production fuzzy search always splits work into two stages because checking edit distance against millions of terms is computationally prohibitive. Stage one is candidate generation: use specialized data structures to prune the universe of indexed terms down to a tiny shortlist (typically fifty to two hundred candidates per token) in ten to thirty milliseconds. Stage two then applies expensive scoring functions only to this shortlist. Without this separation, you cannot hit sub 100 millisecond latencies at scale. The most common stage one technique is character n gram indexing, typically trigrams. Every term is decomposed into overlapping three character sequences. The word "wireless" becomes "wir", "ire", "rel", "ele", "les", "ess". These n grams are indexed so a query token shares n grams with candidate terms. You compute Jaccard similarity (intersection over union of n gram sets) and keep only terms above a threshold, say 0.3 or higher. This filters millions of terms to dozens of plausible candidates extremely fast. The trade off is index size: n gram indexes typically inflate storage by three to ten times compared to exact only indexes, and you need enough memory to keep hot portions of the index cached or risk high p95 latencies from disk seeks. Another powerful approach is Levenshtein automata, a finite state machine compiled per query token that recognizes all strings within edit distance 2. You walk the term dictionary with this automaton, accepting only matching terms. This avoids index bloat but increases CPU cost per query. It works best when applied selectively to specific fields (names, titles) or when combined with prefix filtering to limit the search space. For smaller dictionaries (under a few million terms), BK trees or SymSpell delete dictionaries are alternatives: SymSpell precomputes all possible single or double character deletions of dictionary terms and looks up query deletes for instant candidate retrieval, extremely fast for edit distance 1 or 2 but requires memory proportional to dictionary size times delete variants. A concrete pattern from the field: Amazon style retail search typically uses n gram indexes for broad recall on product titles and descriptions, then applies Levenshtein automata or delete dictionary lookups for brand and model name fields where precision is critical. Google uses massive vocabularies with heavy precomputation and language model priors to score candidate corrections. Spotify constrains fuzzy matching to name fields (artist, track, playlist) and applies it only when prefix matching yields insufficient results, keeping latency low.
💡 Key Takeaways
Character n gram indexes (trigrams) enable Jaccard similarity filtering and prune millions of terms to dozens of candidates in under 30 milliseconds, but inflate index size by three to ten times
Levenshtein automata compiled per query token walk the dictionary accepting only terms within edit distance 2, avoiding index bloat but increasing CPU cost per query; best applied to selected fields
SymSpell delete dictionaries precompute all single and double character deletions for instant lookup, extremely fast for small to medium dictionaries but require memory proportional to dictionary size times deletions
Typical stage one outputs fifty to two hundred candidates per token; stage two then applies expensive scoring (phrase proximity, field weights, popularity, personalization) only to this shortlist
Trade off between index size and query time CPU: precomputed structures (n grams, deletes) shift cost to storage and indexing; automata and dynamic checks shift cost to runtime compute
📌 Examples
A trigram index on a 5 million term product catalog might produce a 15 GB to 50 GB index (three to ten times base size) but prune candidate sets from 5M to 50 terms in 15 milliseconds using cached postings
Typesense uses in memory n gram indexes and reports 28 milliseconds average search time on a 28 million book dataset with typo tolerance, handling hundreds of concurrent queries per second
← Back to Fuzzy Search & Typo Tolerance Overview
Two Stage Fuzzy Search Pipeline: Fast Candidate Generation | Fuzzy Search & Typo Tolerance - System Overflow