Geospatial & Location ServicesProximity SearchMedium⏱️ ~2 min

Text Proximity: Positional Indexes and Slop Evaluation

Text Proximity Defined

Text proximity search finds documents where specified words appear near each other. The query "machine learning" as a phrase requires those words adjacent. The query "machine NEAR learning" allows words within some distance, called slop.

Why it matters: "the machine was learning" matches a proximity query with slop 2 but not a phrase query. Proximity search handles word order variations, intervening words, and query flexibility that phrase search cannot.

Positional Indexes

Standard inverted indexes store: word to list of document IDs. Positional indexes extend this: word to list of (document ID, position in document). For "machine" appearing at positions 5 and 47 in document 123, store (123, [5, 47]).

Storage cost is significant. Positions add 4 bytes per word occurrence. A document with 1000 words adds 4 KB of position data. For billion-document corpus, positional index can be 10x larger than non-positional. Trade storage for proximity query capability.

Slop Evaluation Algorithm

For query "A NEAR B" with slop S: find documents containing both A and B. For each document, get position lists for A and B. Check if any pair of positions has distance at most S.

Naive approach is O(n*m) for position lists of size n and m. Optimization: sort positions and use two pointers advancing through lists. If |posA - posB| is at most S, match found. Otherwise advance the smaller position. This is O(n + m).

✅ Best Practice: Use positional indexes when phrase and proximity queries are important. The storage overhead is significant but enables queries that standard inverted indexes cannot support. Consider selective positional indexing for only high-value fields.
💡 Key Takeaways
Text proximity finds documents where words appear within specified distance (slop)
Positional indexes store word positions, not just document IDs
Storage cost: 4 bytes per word occurrence, can be 10x larger than non-positional
Slop evaluation: two-pointer algorithm on sorted position lists is O(n + m)
Selective positional indexing reduces storage while enabling key proximity queries
📌 Interview Tips
1Explain phrase vs proximity: phrase requires exact adjacency, proximity allows intervening words within slop
2Calculate storage overhead: 1000 word document with positions adds 4 KB vs negligible without
3Describe two-pointer algorithm: advance smaller position until distance is within slop or lists exhausted
← Back to Proximity Search Overview
Text Proximity: Positional Indexes and Slop Evaluation | Proximity Search - System Overflow