Data Lakes & LakehousesData Discovery & SearchEasy⏱️ ~2 min

What is Data Discovery & Search?

Definition
Data Discovery & Search is an enterprise system that helps engineers and analysts find the right dataset among thousands of options, understand if it is safe to use, and know how to query it. Think of it as Google for your company's internal data.
The Core Problem: In a large company, you might have 5,000 to 50,000 datasets spread across data lakes, warehouses, Kafka topics, and BI tools. Without a discovery system, engineers spend hours or even days asking in Slack, searching wikis, or digging through code to find the correct table. One engineer might use the users_daily table while another uses daily_active_users, creating inconsistent metrics across the organization. How It Works: A discovery system builds a searchable catalog of metadata. It does not scan raw data on every query. Instead, it continuously ingests three types of metadata: First, technical metadata includes schemas, data types, storage locations, and query history. Second, business metadata covers descriptions, owners, domains, and business terms. Third, operational metadata tracks freshness, data quality scores, incident history, and usage statistics. This metadata is normalized into a graph structure representing datasets, columns, owners, lineage between tables, and usage patterns. When you search for "daily active users for Android", the system queries this metadata graph and returns ranked results in under 200 milliseconds, enriched with ownership, quality scores, and sample queries.
✓ In Practice: This is not just a search box. It is search plus trust plus context. You need to know not just where the data is, but whether it is fresh, accurate, who owns it, and whether you have permission to use it.
💡 Key Takeaways
Discovery is an enterprise search engine for data assets, not a tool that scans raw data on every query
A metadata graph captures technical, business, and operational information about datasets, columns, and relationships
Target search latency is under 200 milliseconds p95 even with millions of entities in the catalog
Without discovery, engineers waste hours per week finding data, leading to inconsistent metrics and slower product iteration
📌 Examples
1An engineer searches for "daily active users for Android" and gets back not just table names, but also ownership information, last refresh time, data quality scores, and sample queries
2A data scientist filters for all customer features with freshness under 15 minutes to mount them in an ML platform, with lineage automatically recorded
← Back to Data Discovery & Search Overview
What is Data Discovery & Search? | Data Discovery & Search - System Overflow