Metadata Ingestion and the Universal Catalog

The Ingestion Challenge:

At scale, new tables, views, and Kafka topics appear constantly. A large company might add thousands of datasets per month across multiple warehouses, data lakes, and streaming platforms. The discovery system must continuously capture metadata from all these sources without becoming a bottleneck or missing critical updates.

Two Ingestion Patterns:

Discovery systems use a hybrid approach combining batch crawlers and event driven ingestion.

Batch crawlers periodically scan storage systems, warehouses, streaming platforms, and BI tools to detect new or changed entities. Critical systems are crawled every few minutes, while long tail systems are scanned daily. This catches changes even when source systems do not emit events.

Event driven ingestion listens to schema registries, job orchestration events, and catalog update events. When a pipeline creates a new table or a schema changes, the discovery system receives an immediate notification. This reduces latency from minutes or hours to seconds and avoids expensive full rescans.

Metadata Update Latency
BATCH ONLY
30 min
→
HYBRID
2 min
Building the Universal Catalog:

Raw metadata from diverse sources must be normalized into a canonical model. This model typically includes datasets, fields, pipelines, dashboards, users, and the relationships between them. Modern lakehouse platforms like Google Dataplex and AWS lakehouse stacks position a universal catalog that spans data lakes with open formats like Apache Iceberg, warehouses, and ML platforms.

The processing layer enriches this raw metadata with field level statistics from profiling runs, including null counts and cardinality. It adds data quality metrics like success rates and freshness. It applies classification labels for Personally Identifiable Information (PII), financial data, and confidential information using rules and machine learning models. Finally, it assigns ownership and domain information based on organization data.

⚠️ Common Pitfall: If metadata ingestion lags or breaks silently, users discover datasets that are deprecated or broken. This erodes trust faster than having no discovery system at all. A common Service Level Objective (SLO) is that critical metadata changes are visible in discovery within a few minutes.
Storage Requirements:

The metadata store needs strong consistency for updates to critical entities, efficient graph traversal for lineage queries, and versioning for audit purposes. Many companies manage fewer than 100 million nodes and edges in their metadata graph, which is tractable with proper indexing in relational databases, graph databases, or key value stores with secondary indexes.

💡 Key Takeaways

✓Hybrid ingestion combines batch crawlers for completeness with event driven updates for low latency, achieving metadata freshness of a few minutes

✓A universal catalog normalizes metadata from lakes, warehouses, and streaming platforms into a canonical model with datasets, fields, and relationships

✓Enrichment adds field level statistics, quality metrics, PII classification, and ownership information on top of raw technical metadata

✓Common SLO is metadata updates visible within a few minutes for critical systems; stale metadata breaks user trust faster than no discovery at all

📌 Interview Tips

1When a data engineer creates a new Iceberg table in the lake, a schema registry event triggers immediate ingestion into the catalog, making it searchable within 2 minutes

2A profiling job runs nightly on high value datasets, extracting null counts and cardinality for all columns and storing them in the catalog for search filters

← Back to Data Discovery & Search Overview