Failure Modes and Edge Cases in Fuzzy Search
Short Token Explosion
Short tokens are the most common fuzzy search failure. Allowing edit distance 1 on tokens like "tv", "ps", or "cd" matches an enormous vocabulary fraction. A query for "ps" with edit distance 1 matches "as", "is", "us", "vs", "psi", "spa", "spy", and dozens more. English has roughly 100 two letter and 1000 three letter valid words; with edit distance 1, you match 10 to 30 percent of these. The mitigation is strict: disable fuzzy entirely (edit distance 0) for tokens under 4 characters, or only allow prefix matching. The rule: token length under 4 uses exact match; length 4 to 6 allows edit distance 1; over 6 allows edit distance 2.
Entity and Brand Name Overcorrection
Proper nouns and brand names create high impact failures. If a user types "nikee", naive fuzzy search might autocorrect to "niche" or "nice" instead of "nike". Searches for "adiddas" should find "adidas", not "additions". The fix requires a high quality entity dictionary containing brands, products, names, and locations. These entities receive elevated priors (2x to 5x score boost) in the scoring function. When a query token is within edit distance 1 of a known entity, strongly prefer that entity over generic words. When correction confidence is low (multiple equally good matches), show "did you mean" suggestions rather than auto applying corrections blindly.
Multi Token Cross Contamination
Multi token queries introduce cross token misalignment. Correcting tokens independently produces semantically inconsistent results. The query "wirless mouse logitec" might correct to "wireless mouse logistics" because logistics is close to logitec in isolation, but semantically wrong in electronics context. Language models scoring joint likelihood of corrected queries help but add 10 to 50ms latency. A simpler production heuristic: allow at most 2 fuzzy tokens in a 5 token query, and always boost queries where all tokens match exactly or with total edit distance under 3. This prevents drifting too far from user intent.
Operational Failure Modes
Index growth under fuzzy search is an operational concern. N gram indexes grow 3x to 10x larger than exact indexes because each term generates multiple overlapping n grams. If index size exceeds available memory and forces disk access, p95 latencies balloon from 30ms to 300 plus ms. Monitor heap usage and page cache hit rates. Shard hot fields into separate in memory indexes, or use tiered storage with hot entities in RAM and long tail on SSD. Adversarial queries (very long strings, repeated characters like "aaa" times 100) trigger worst case automata behavior. Apply query shaping: cap query length to 200 chars, limit to 20 tokens, minimum token length 2, and terminate candidate expansion after 200 candidates per token.