Text Proximity: Positional Indexes and Slop Evaluation
Text Proximity Defined
Text proximity search finds documents where specified words appear near each other. The query "machine learning" as a phrase requires those words adjacent. The query "machine NEAR learning" allows words within some distance, called slop.
Why it matters: "the machine was learning" matches a proximity query with slop 2 but not a phrase query. Proximity search handles word order variations, intervening words, and query flexibility that phrase search cannot.
Positional Indexes
Standard inverted indexes store: word to list of document IDs. Positional indexes extend this: word to list of (document ID, position in document). For "machine" appearing at positions 5 and 47 in document 123, store (123, [5, 47]).
Storage cost is significant. Positions add 4 bytes per word occurrence. A document with 1000 words adds 4 KB of position data. For billion-document corpus, positional index can be 10x larger than non-positional. Trade storage for proximity query capability.
Slop Evaluation Algorithm
For query "A NEAR B" with slop S: find documents containing both A and B. For each document, get position lists for A and B. Check if any pair of positions has distance at most S.
Naive approach is O(n*m) for position lists of size n and m. Optimization: sort positions and use two pointers advancing through lists. If |posA - posB| is at most S, match found. Otherwise advance the smaller position. This is O(n + m).