How Metadata Flows Through a Catalog System

The Ingestion Challenge:
A data catalog must continuously learn about changes across your entire data platform. When someone creates a table, runs a pipeline, or updates a dashboard, the catalog needs to know. At companies like LinkedIn or Netflix, this means processing tens of thousands of metadata events per minute across millions of entities.

The ingestion layer supports two patterns. Push based ingestion receives events in real time. When a scheduled job completes, the orchestration system emits a lineage event describing which datasets were read and written. The catalog ingests these immediately, updating its graph within seconds. Pull based ingestion uses periodic crawlers that scan data stores every few hours, discovering new tables and schema changes even when systems don't emit events.

1
Discovery phase: A pipeline writes a new table to the warehouse. Within 30 to 120 seconds, either an event arrives or a crawler detects it. The catalog extracts schema, infers owner from job metadata, and creates an entity record.
2
Lineage capture: Every job run generates lineage events. A single job might read from 5 input tables and write to 2 output tables. At LinkedIn scale with hundreds of millions of lineage edges, the catalog ingests thousands of these updates per second.
3
Indexing: Background workers denormalize metadata and build search indexes. They compute derived fields like popularity scores (based on query counts) and trust levels (based on certification and quality checks).
4
Query serving: When an analyst searches, the request fans out to search indexes. The service ranks results using freshness, usage, and ownership signals, returning top matches in under 300 ms at p95 latency.
Consistency Model:
The catalog trades strong consistency for scalability. Critical user edits like updating a table description or changing ownership use read after write consistency. The system writes through both the metadata store and search index in a single operation, so your change appears immediately.

Most other updates are eventually consistent within 1 to 5 minutes. When a job runs at 3:00 PM and updates lineage, the lineage graph might not reflect that change until 3:02 PM. For discovery and documentation, this delay is acceptable and allows the system to buffer events, batch updates, and handle much higher throughput.

⚠️ Common Pitfall: Assuming all metadata needs real time updates. Real time ingestion under 1 second is expensive and complex. Most use cases work fine with minute level delays, which dramatically simplifies architecture and reduces cost.
Scale Numbers:
At companies like Uber and Airbnb, a production catalog handles 10,000 to 100,000 metadata mutations per minute. The ingestion pipeline needs buffering with message queues, retry logic for failed events, and idempotent processing since events may arrive multiple times. Background reconciliation jobs regularly rescan sources, catching any missed events. This two layer approach (fast event ingestion plus periodic reconciliation) ensures metadata stays accurate even when individual events are lost.

💡 Key Takeaways

✓Push based ingestion captures events in real time while pull based crawling discovers changes that systems don't emit, creating redundant coverage

✓At LinkedIn scale, ingestion handles thousands of lineage updates per second across hundreds of millions of edges while keeping query latency under 200 ms p99

✓Critical user edits use read after write consistency for immediate visibility, while background metadata updates are eventually consistent within 1 to 5 minutes

✓Buffering with message queues, retry logic, and idempotent processing handle event delivery failures and duplicate messages

✓Periodic reconciliation jobs rescan sources to catch missed events, ensuring metadata accuracy despite individual event losses

📌 Interview Tips

1A data engineer updates a table description at 2:00 PM. The write goes to both the metadata store and search index atomically. When they refresh the page at 2:00:05, their change is visible immediately.

2A pipeline runs at 3:00 PM, reading from 5 tables and writing to 2 tables. The orchestrator emits 7 lineage relationships. These are buffered in a queue, processed in batches, and visible in the lineage graph by 3:02 PM.

3A warehouse creates 500 new tables overnight. The morning crawler run at 8:00 AM discovers them all, even though no events were emitted. By 8:30 AM all are searchable in the catalog.

← Back to Data Catalog Systems Overview