What is Data Lineage Tracking?

Definition
Data Lineage is a graph based system that tracks where data came from, how it was transformed, when it changed, and where it flows next across your data platform.
The Core Problem:

Imagine you work at a company with thousands of data tables spread across multiple teams and systems. A critical dashboard suddenly shows wrong numbers. Where did the bad data come from? Which upstream tables are affected? Which downstream reports are now broken? Without lineage, you're looking at a black box.

The problem becomes severe at scale. A single dashboard at Amazon might depend on hundreds of upstream tables, jobs spread across different clouds and regions. When that dashboard breaks at 3 AM, manual investigation could take hours or days.

How Lineage Solves This:

Data lineage treats your data platform as a dependency graph with three core entities. First, processes represent job definitions or pipeline steps. Second, runs capture individual executions with timestamps and status. Third, events record read and write operations on datasets.

Each data processing job emits lineage events. When your batch job runs every 15 minutes processing 2 TB of clickstream data, it registers which datasets it read from and wrote to. Over time, these events build a complete graph connecting all your data flows.

Real World Impact:

At Meta scale, this means tracking 10,000+ datasets, 5,000 scheduled jobs, and hundreds of dashboards. When a pipeline fails at 03:05 UTC, engineers can immediately see which reports downstream are now stale. When compliance asks where customer email addresses flow, you can trace from raw source through every transformation to final storage. Instead of hours of detective work, you get answers in seconds.

💡 Key Takeaways

✓Lineage captures three entities: processes (job definitions), runs (executions with timestamps), and events (read/write operations on datasets)

✓At large organizations, lineage graphs can contain tens of millions of links connecting thousands of datasets across multiple systems and clouds

✓Primary use cases include impact analysis before schema changes, incident debugging to find affected downstream systems, and compliance tracking for regulated data

✓Without lineage, finding the root cause of bad data in a complex platform can take hours or days of manual investigation

✓Lineage systems must handle scale limits, such as restricting graph traversals to 20 hops and 10,000 links to keep query latency under a few hundred milliseconds

📌 Interview Tips

1A batch job processes 2 TB of clickstream data every 15 minutes. It emits lineage events recording that it read from <code>raw.events</code> at 03:15 UTC and wrote to <code>processed.user_sessions</code>, creating a dependency link in the graph.

2When a critical fact table changes schema, engineers query lineage to discover it has 150 downstream dependencies including dashboards, ML models, and other tables. This prevents breaking production systems.

3During a production incident at 03:05 UTC, SREs use lineage to trace which downstream reports are stale. Instead of manually checking hundreds of jobs, they get a visual graph showing 47 affected datasets within seconds.

4A compliance team needs to verify that EU customer data never flows to US only systems. They query column level lineage for the <code>user.region</code> field and trace every downstream table and process that touches it.

← Back to Data Lineage Tracking & Visualization Overview