Geospatial & Location Services • Proximity SearchMedium⏱️ ~2 min
Text Proximity: Positional Indexes and Slop Evaluation
Text proximity search finds documents where query terms appear within a small window of token positions, enabling phrase like matching without requiring exact adjacency. This powers features like quoted phrase search and relevance boosting based on term distance. The implementation relies on positional inverted indexes where each posting list entry stores not just document IDs but also the token positions where the term appears.
A standard inverted index maps term to document IDs. A positional index extends this: term maps to a list of (document ID, [position1, position2, ...]) tuples. For the phrase "machine learning" with slop 2, the engine retrieves postings for both terms, then for each document containing both, it scans the sorted position lists with a two pointer or sliding window algorithm to find whether any occurrence of "machine" is followed by "learning" within 2 token positions. This allows matches like "machine advanced learning" (1 token gap) while rejecting "machine vision and deep learning" (3 token gap).
The space cost is significant. Positional indexes are typically 20% to 40% larger than non positional indexes due to storing integer position arrays. Delta encoding compresses positions: instead of storing absolute positions [5, 12, 47], store deltas [5, 7, 35], which compress better with variable length integer encoding. Production engines like Elasticsearch and Lucene use this heavily.
Proximity boosts are integrated into ranking functions. Exact phrase matches (slop 0) receive maximum boost, and larger slops receive progressively less weight. A common scoring feature is the inverse of the minimal span length containing all query terms: documents where terms cluster tightly score higher than those where terms are scattered. This balances precision and recall: tight slop narrows results but risks missing paraphrases, while large slop increases recall but admits weaker matches and higher CPU cost due to evaluating more position windows.
💡 Key Takeaways
•Positional inverted indexes store token positions for each term occurrence, enabling phrase and proximity matching by scanning sorted position lists with sliding window algorithms
•Index size increases 20% to 40% versus non positional indexes; delta encoding compresses position arrays by storing gaps instead of absolute positions
•Slop parameter controls maximum positional displacement: slop 0 requires exact adjacency for phrases, slop 2 allows up to 2 intervening tokens
•Proximity boosts integrate into ranking by scoring the inverse of minimal span length; tighter clustering of query terms produces higher relevance scores
•Large slop values (for example, slop greater than 10) can cause quadratic worst case behavior in documents with high term frequency; guard with per document match windows and early termination
📌 Examples
Elasticsearch phrase query with slop: {"match_phrase": {"content": {"query": "machine learning", "slop": 2}}} matches "machine advanced learning" but not "machine vision and deep learning"E commerce search engines boost exact product name phrases with slop 0 while allowing slop 3 or 5 for broader category matches, balancing precision for known products and recall for exploratory queries