Text Proximity: Positional Indexes and Slop Evaluation

Text Proximity Defined
Text proximity search finds documents where specified words appear near each other. The query "machine learning" as a phrase requires those words adjacent. The query "machine NEAR learning" allows words within some distance, called slop.
Why it matters: "the machine was learning" matches a proximity query with slop 2 but not a phrase query. Proximity search handles word order variations, intervening words, and query flexibility that phrase search cannot.
Positional Indexes
Standard inverted indexes store: word to list of document IDs. Positional indexes extend this: word to list of (document ID, position in document). For "machine" appearing at positions 5 and 47 in document 123, store (123, [5, 47]).
Storage cost is significant. Positions add 4 bytes per word occurrence. A document with 1000 words adds 4 KB of position data. For billion-document corpus, positional index can be 10x larger than non-positional. Trade storage for proximity query capability.
Slop Evaluation Algorithm
For query "A NEAR B" with slop S: find documents containing both A and B. For each document, get position lists for A and B. Check if any pair of positions has distance at most S.
Naive approach is O(n*m) for position lists of size n and m. Optimization: sort positions and use two pointers advancing through lists. If |posA - posB| is at most S, match found. Otherwise advance the smaller position. This is O(n + m).
✅ Best Practice: Use positional indexes when phrase and proximity queries are important. The storage overhead is significant but enables queries that standard inverted indexes cannot support. Consider selective positional indexing for only high-value fields.

💡 Key Takeaways

✓Text proximity finds documents where words appear within specified distance (slop)

✓Positional indexes store word positions, not just document IDs

✓Storage cost: 4 bytes per word occurrence, can be 10x larger than non-positional

✓Slop evaluation: two-pointer algorithm on sorted position lists is O(n + m)

✓Selective positional indexing reduces storage while enabling key proximity queries

📌 Interview Tips

1Explain phrase vs proximity: phrase requires exact adjacency, proximity allows intervening words within slop

2Calculate storage overhead: 1000 word document with positions adds 4 KB vs negligible without

3Describe two-pointer algorithm: advance smaller position until distance is within slop or lists exhausted

← Back to Proximity Search Overview