ML-Powered Search & Ranking • Query Understanding (Intent, Parsing, Rewriting)Medium⏱️ ~2 min
Entity Parsing and Linking in Query Understanding
Entity parsing identifies structured attributes within raw query text, while entity linking maps surface forms to canonical identifiers in a knowledge base. When a user types "macbook air 13 inch silver," the parser must tag product type MacBook Air, size 13 inch, and color silver, then link "macbook air" to Apple's canonical product line identifier. This structured output enables precise filtering and faceted search.
Production parsers use sequence tagging models like Conditional Random Fields (CRF) or Bidirectional Long Short Term Memory (BiLSTM) networks with 3 to 8 milliseconds latency at p50. The models predict Beginning, Inside, Outside (BIO) tags for each token, identifying spans for brand, category, attributes, and modifiers. Feature engineering includes word embeddings, part of speech tags, capitalization patterns, surrounding context windows of 2 to 5 tokens, and gazetteer matches against dictionaries of known brands and attribute values. Amazon maintains brand dictionaries with hundreds of thousands of entries and uses fuzzy matching with edit distance thresholds conditioned on string length to handle misspellings like "addidas" to "Adidas."
Entity linking is where precision becomes critical. Ambiguous mentions like "apple" could refer to the fruit or the technology company. Linking systems use product type agreement, requiring the detected category to match the entity's domain before linking. They also maintain confidence thresholds of 0.7 to 0.85 and include an abstain path that keeps the original token unlinked when confidence is low. Meta's marketplace search links location mentions to canonical place identifiers using geospatial context and user location history. Failures in linking cause downstream problems. Mapping "ga" incorrectly to "GAP" instead of "Georgia" over constrains results and leads to zero result pages.
💡 Key Takeaways
•Sequence tagging models like Conditional Random Fields (CRF) or Bidirectional Long Short Term Memory (BiLSTM) predict Beginning, Inside, Outside (BIO) tags with 3 to 8 milliseconds latency at p50, using word embeddings and context windows of 2 to 5 tokens.
•Amazon maintains brand dictionaries with hundreds of thousands of entries and uses edit distance thresholds conditioned on string length to handle misspellings, mapping "addidas" to "Adidas" with Levenshtein distance under 2 for strings longer than 5 characters.
•Entity linking requires product type agreement and confidence thresholds of 0.7 to 0.85. When confidence is low, systems abstain and keep the original token unlinked rather than risk incorrect mapping.
•Ambiguous entities like "apple" are disambiguated using detected category. If the query includes "laptop" or "phone," link to Apple Inc. If the query includes "fruit" or "pie," link to the fruit or abstain.
•Linking failures cause expensive downstream problems. Incorrectly mapping "ga" to "GAP" brand instead of "Georgia" location over constrains filters and produces zero results, increasing abandonment by 15 to 25 percent in e-commerce contexts.
📌 Examples
Google Search: query "apple airpods pro" extracts brand Apple and product AirPods Pro, links to knowledge graph entity for Apple Inc product line, enables structured result cards with pricing and availability.
Amazon: query "samsung 55 inch 4k tv under 500" parses brand Samsung, size 55 inch, resolution 4K, category TV, price_max 500, links Samsung to canonical brand ID, applies filters to catalog retrieval in 11 milliseconds.
Airbnb: query "cabin lake tahoe 2 bedrooms" extracts property_type cabin, location Lake Tahoe, bedrooms 2, links Lake Tahoe to canonical place identifier with geospatial coordinates, routes to lodging index with location radius filter.