Production Lineage at FAANG Scale

Scale Realities:

At Meta, Google, or similar scale, you're managing lineage for 10,000+ datasets, 5,000+ scheduled jobs, hundreds of BI dashboards, and thousands of ML models. The lineage graph contains tens of millions of links. A single highly used reference table might have thousands of downstream dependencies. This creates unique operational challenges.

The Architecture Pattern:

Most large lineage platforms follow a consistent design. Data processing engines emit lineage events via a message bus or service API. Events describe a process (job definition), a run (execution with timestamp and status), and dataset operations (reads and writes with fully qualified names like project.dataset.table).

These events flow into a central lineage service that validates, deduplicates, and enriches them. The service persists data into a graph database optimized for online traversal queries, backed by columnar storage for longer term analytics. Query latency targets are typically sub second at p50 and under a few seconds at p99 for multi hop traversals.

Graph Explosion Problem:

A central reference dataset used by thousands of jobs creates a hub node in your graph. Traversing from that node can hit tens of thousands of edges. Without limits, visualization tools would choke trying to render this in a browser.

Production systems set hard limits. Google Dataplex, for example, restricts traversals to 20 hops and 10,000 graph links in a single query. This keeps p99 latency under a few hundred milliseconds. When you hit these limits, the UI truncates or collapses sections of the graph, showing aggregate counts instead of individual edges.

Traversal Limits
20
MAX HOPS
10K
MAX LINKS
200ms
P99 LATENCY
Ingestion and Freshness:

Lineage is only useful if it's current. Systems monitor ingestion lag, ensuring events appear in the graph within minutes of job completion, not hours. Cloud warehouses like BigQuery automatically emit lineage for query, load, and copy jobs. Spark and Flink integrations require instrumentation but can achieve similar freshness.

At steady state, a large platform might ingest 100,000 lineage events per hour during peak. The service must handle bursts, deduplicate identical events from retries, and backfill history when integrating a new engine.

Retention Strategy:

Keeping full fidelity lineage forever is expensive. Common pattern: keep 30 days of hot, fully traversable lineage in fast graph storage. Downsample or aggregate older entries, preserving only critical links or governance relevant flows. Archive the rest in compressed columnar format for compliance queries.

This balances cost with forensic capability. If an ML model trained 6 months ago needs audit, you can reconstruct its lineage from archives, even if interactive traversal isn't instant.

💡 Key Takeaways

✓Production lineage platforms at Meta or Google scale manage tens of millions of links across 10,000+ datasets and 5,000+ jobs

✓Graph traversal is limited to 20 hops and 10,000 links per query to keep p99 latency under a few hundred milliseconds and prevent browser rendering failures

✓Ingestion lag must stay under minutes (not hours) to make lineage useful for incident response and debugging

✓Hot lineage is kept for 30 days in fast graph storage, then downsampled or archived in compressed columnar format to balance forensics capability with cost

✓At peak, large platforms ingest 100,000 lineage events per hour and must handle bursts, deduplication, and backfill when integrating new engines

📌 Interview Tips

1A batch job at Meta processes 2 TB of data every 15 minutes, emitting lineage events that appear in the graph within 2 minutes, allowing real time impact analysis.

2A reference dimension table used by 2,000 downstream jobs creates a hub node. Querying its full downstream graph would return 50,000+ edges, so the UI limits to 10,000 and shows 'Plus 40,000 more' with filtering options.

3BigQuery automatically emits lineage for standard query and load jobs. When you run a query joining 5 tables and writing to 1 output, lineage events flow to the central service within seconds.

4An ML team needs to audit a model trained 8 months ago. The system reconstructs lineage from compressed archives, showing all 47 source tables and 12 transformation steps, though retrieval takes 30 seconds instead of instant.

← Back to Data Lineage Tracking & Visualization Overview