Search Indexing and Serving at Scale

From Metadata to Search:

Once metadata is collected and enriched, it must be indexed for fast, flexible search. The challenge is supporting full text search over names, descriptions, and column names while also enabling faceted filtering by owner, domain, type, freshness, classification, and quality score. Users expect Google level responsiveness: results in under 200 milliseconds even when the catalog has millions of entities.

Building the Search Index:

A dedicated search index is built from the metadata store, separate from the transactional store used for updates. This index supports multiple query patterns that would be slow in a relational database alone.

Full text search covers entity names, descriptions, column names, documentation, and tags. For example, searching "daily active users" matches tables named users_daily_active, dau_android, and any table with "daily active users" in its description.

Faceted filters let users narrow results by selecting multiple criteria at once. "Show me datasets owned by the payments team, refreshed in the last hour, with high quality scores, containing PII." This requires inverted indexes on each filterable attribute.

Ranking and Relevance:

Poor ranking is a silent failure mode. If the correct dataset is buried on page three, users either pick the wrong table or give up. Custom ranking combines several signals: text relevance score, usage popularity from query logs, recency of last update, and endorsement signals like documentation completeness or number of downstream consumers.

Search Performance Targets
200ms
P95 SEARCH LATENCY
500ms
LINEAGE GRAPH
100s
QPS CAPACITY
Handling Ambiguity:

Search must handle synonyms and abbreviations that vary across teams. "DAU" might mean daily active users in one team and daily active uploads in another. Historical datasets coexist with current recommended ones, and similarly named tables serve different purposes. The system needs to understand context from the user's team, recent queries, and dataset usage patterns.

Caching is used heavily for popular entities, auto complete suggestions, and common filter combinations. In large organizations, discovery systems serve hundreds of queries per second, and caching reduces load on the backend metadata store.

Integration Beyond Search:

Discovery is not just a web UI. It exposes APIs that notebook environments, BI tools, and ML platforms call to let users pick datasets without leaving their tools. An ML platform might call the catalog to list all "customer features" with freshness under 15 minutes, then mount them directly with end to end lineage automatically recorded.

💡 Key Takeaways

✓Target search latency is 100 to 200 milliseconds p95 for interactive use, with capacity for hundreds of queries per second in large organizations

✓Custom ranking combines text relevance, usage popularity from query logs, recency, and endorsement signals like documentation completeness

✓Poor ranking is a silent failure: if the right dataset is not in the top few results, users pick the wrong table or give up entirely

✓APIs enable programmatic integration so notebooks, BI tools, and ML platforms can search the catalog without users leaving their workflows

📌 Interview Tips

1Searching "daily active users" returns <code style="padding: 2px 6px; background: #f5f5f5; border: 1px solid #ddd; border-radius: 3px; font-family: monospace; font-size: 0.9em;">users_daily_active</code> ranked above <code style="padding: 2px 6px; background: #f5f5f5; border: 1px solid #ddd; border-radius: 3px; font-family: monospace; font-size: 0.9em;">users_daily_historical</code> because it has 100x more queries in the last month

2An ML platform calls the catalog API to list all features tagged "customer" with <code style="padding: 2px 6px; background: #f5f5f5; border: 1px solid #ddd; border-radius: 3px; font-family: monospace; font-size: 0.9em;">freshness_minutes</code> less than 15, returning 47 features in 120ms

← Back to Data Discovery & Search Overview