Data Lakes & Lakehouses • Metadata Management & CatalogsHard⏱️ ~3 min
Metadata Catalog Implementation Architecture
The Five Layer Architecture:
Production grade metadata catalogs follow a consistent architectural pattern that interviewers expect you to understand. The pattern has five layers: ingest, normalize, store, index, and serve. Each layer has specific trade-offs and scale challenges.
Layer 1: Event Driven Ingestion
Metadata ingestion is event driven, not batch. Connectors in storage systems, schedulers, ETL tools, and BI platforms emit metadata change events whenever datasets, schemas, or jobs change. For example, when a new Delta Lake table version commits, a transaction log listener publishes a message with table name, version, operation type, and affected files.
At high scale, ingestion pipelines handle thousands to tens of thousands of metadata events per second. This requires careful backpressure handling and idempotency: if the same schema change event arrives twice due to retries, the catalog must deduplicate it. Most implementations use message queues like Kafka with consumer groups to parallelize ingestion and guarantee at least once delivery.
Layer 2: Normalization and Modeling
Different systems represent concepts incompatibly. Hive uses "database + schema + table," BigQuery uses "project + dataset + table," Snowflake uses "database + schema + table" with different semantics. The catalog maps these into a unified data model, often a graph where nodes are datasets, columns, jobs, and dashboards, and edges represent lineage or ownership.
Strong typing and versioning are essential. When a column changes from integer to string, the catalog must track both the old and new versions to support time travel queries over metadata. This is critical for debugging: "What did the schema look like when the pipeline broke last Tuesday?"
Layer 3: Multi System Storage
Most implementations use a combination of storage systems. A relational database (Postgres, MySQL) stores authoritative metadata entities with ACID (Atomicity, Consistency, Isolation, Durability) guarantees for writes. A search index (Elasticsearch, Solr) powers free text search over descriptions, tags, and field names with faceted filtering by owner, domain, or classification. A graph database or graph layer (Neo4j, custom graph on top of relational) efficiently traverses lineage up and downstream.
At LinkedIn scale, DataHub supports lineage queries over millions of nodes with p99 latencies under a few hundred milliseconds. This requires careful denormalization: precomputing transitive closures for common lineage patterns and caching hot paths.
Layer 4: Serving API and Caching
The serving layer is a stateless API tier and UI. The API handles search, entity fetch by key, lineage traversal, and policy evaluation. Caching is critical: hot entities such as popular tables or dashboards are stored in memory caches (Redis, Memcached) to keep p50 latency under 50 ms even under heavy read loads reaching thousands of QPS.
For write paths, catalogs prioritize durability and ordering, accepting slightly higher latency: 100 to 300 ms p99 for metadata writes is typical. This is acceptable because metadata writes are much rarer than reads.
Layer 5: Policy Enforcement Integration
Policy enforcement integrates the catalog with query engines and access control systems. When a user queries a table, the engine asks the catalog: "What policies apply to user X for dataset Y?" The catalog returns rules like "mask column Z" or "deny access to dataset Y." The engine applies these rules.
This must be fast because it is in the hot path for query planning. Catalogs cache policy evaluations and use hierarchical rule structures to avoid scanning all policies for every query. Target latency is under 10 ms p99.
Serving Layer Performance Targets
<50ms
P50 CACHED
100-300ms
P99 WRITES
<10ms
P99 POLICY CHECK
❗ Remember: The catalog is not just a passive index. It is an active service that influences query planning, access control, and schema evolution across the entire data platform. Reliability patterns mirror other critical services: multi-region deployment, active-active databases, periodic reconciliation from source systems, and health checks that verify both ingestion freshness and serving latency.
The Differentiator:
Treating metadata as a product with SLAs, not as a best effort index, is the distinguishing mark of robust production-grade catalogs. Teams that do this well measure ingestion lag (target: p99 under 1 minute), serving latency (target: p99 under 300 ms), and freshness (target: metadata accurate within 5 minutes of source change).💡 Key Takeaways
✓Production catalogs use five layers: event driven ingestion (Kafka, 10K+ events/sec), normalization to unified graph models, multi-system storage (relational + search + graph), cached serving API (p50 <50ms), and policy enforcement (<10ms p99)
✓Storage is hybrid by necessity: relational databases for ACID writes, search indexes (Elasticsearch) for free text queries, and graph layers for lineage traversal over millions of nodes with sub-second p99 latencies
✓Caching is critical for serving layer performance: hot entities like popular tables are cached in Redis to handle thousands of QPS with p50 latency under 50 ms, while metadata writes accept 100 to 300 ms p99
✓Policy enforcement sits in the hot path for query planning, requiring sub-10 ms p99 latency through hierarchical rule caching and precomputed policy evaluations to avoid blocking queries
📌 Examples
1When a Delta Lake table commits a new version, a Kafka consumer ingests the transaction log event (containing files, row counts, operation type) within 200 ms. The normalization layer maps it to the catalog's graph model, updates Postgres for durability, Elasticsearch for searchability, and invalidates Redis cache entries for that table.
2A lineage query "show all downstream dashboards for table X" hits the graph layer. With 5 million nodes and precomputed transitive closures, it returns 127 downstream assets in 180 ms p99, even during peak hours with 3000 QPS catalog load.
3A query engine plans a query on a PII tagged table. It calls the catalog's policy API, which checks Redis cache for cached policy rules, finds a hit, and returns "mask column email" in 4 ms, allowing the query to proceed without blocking on a database lookup.