Data Lakes & Lakehouses • Metadata Management & CatalogsMedium⏱️ ~3 min
How Metadata Flows Through the Data Platform
The Production Flow:
In a large scale lakehouse environment, metadata is produced and consumed at every stage of the data pipeline. Understanding this flow is critical for interviews because it reveals how catalogs integrate with the entire data stack.
The Active Role of Catalogs:
This is not a passive index. The catalog is an active coordination service that influences how data is stored, transformed, discovered, and secured. When governance systems tag a column as PII, the catalog can automatically trigger masking for all consumers not in a privileged group. These access decisions must be fast, often under 10 ms p99, because they sit in the hot path for query planning.
Real Implementation Examples:
Netflix built Metacat as a federated catalog across Hive, RDS, and other stores, exposing a unified API to thousands of data engineers. LinkedIn created DataHub with push based ingestion from dozens of systems and a graph backend for lineage traversal over millions of nodes. Databricks Unity Catalog centralizes schema and permission management across workspaces and clouds, integrating transaction logs for strong lineage guarantees. AWS Glue Data Catalog acts as a centralized schema store for S3 data, tightly integrated with Athena and Redshift for zero copy access.
The key insight is that the catalog sits at the intersection of storage, compute, and governance, making it a critical component in the data platform architecture.
1
Ingestion: Source systems (microservices, operational databases, event streams) emit change data. Ingestion pipelines land raw data into object storage at rates like 5 to 50 terabytes (TB) per day with p50 ingestion latency under 5 minutes and p99 under 15 minutes.
2
Schema Discovery: Tools like Airbyte or internal connectors infer schemas from incoming data and publish metadata events whenever columns are added or types change. These events flow into a metadata service.
3
Transaction Logs: Storage layers like Delta Lake or Apache Iceberg write transaction logs that include which files belong to which table versions, row counts, and operation types. This transactional metadata streams into the catalog to build lineage and audit trails.
4
Aggregation: The metadata service aggregates, normalizes, and indexes metadata into a searchable catalog. At large companies, this might contain 100,000 to 1 million datasets, 10 million to 100 million columns, and lineage graphs with billions of edges.
5
Consumption: Analysts, ML engineers, and automated services query the catalog to find datasets. They expect sub-100 millisecond (ms) p50 latency and sub-300 ms p99 for typical lookups. Throughput can reach thousands of queries per second (QPS) during working hours.
Catalog Query Performance
<100ms
P50 LOOKUP
<300ms
P99 LOOKUP
1000s
QPS PEAK
💡 Key Takeaways
✓Metadata flows event driven through the pipeline: ingestion tools publish schema changes, storage layers like Delta Lake emit transaction logs, and the catalog aggregates everything into a unified index
✓At scale, catalogs handle 100,000 to 1 million datasets with billions of lineage edges, serving thousands of QPS with sub-100 ms p50 and sub-300 ms p99 latency for lookups
✓The catalog is not passive: it sits in the hot path for query planning and access control, making policy decisions in under 10 ms p99 to avoid blocking queries
✓Real implementations vary: Netflix Metacat federates across stores, LinkedIn DataHub uses graph backends for lineage, Databricks Unity Catalog integrates transaction logs for strong guarantees
📌 Examples
1When a Delta Lake table commits a new version, it writes a transaction log entry with operation type, file paths, and row counts. A listener streams this to the catalog within seconds, updating freshness timestamps and lineage graphs.
2An analyst queries the catalog for "tables containing user email." The catalog searches across 500,000 datasets using an inverted index, returning 47 matches in 68 ms with owners, SLAs, and PII tags.
3A governance policy tags a column as PII. The catalog propagates this to all query engines. When a non-privileged user queries the table, the engine asks the catalog for policies, receives a "mask column" directive in 8 ms, and applies it before returning results.