Data Governance & Lineage • Data Lineage Tracking & VisualizationEasy⏱️ ~2 min
What is Data Lineage Tracking?
Definition
Data Lineage is a graph based system that tracks where data came from, how it was transformed, when it changed, and where it flows next across your data platform.
processes represent job definitions or pipeline steps. Second, runs capture individual executions with timestamps and status. Third, events record read and write operations on datasets.
Each data processing job emits lineage events. When your batch job runs every 15 minutes processing 2 TB of clickstream data, it registers which datasets it read from and wrote to. Over time, these events build a complete graph connecting all your data flows.
Real World Impact:
At Meta scale, this means tracking 10,000+ datasets, 5,000 scheduled jobs, and hundreds of dashboards. When a pipeline fails at 03:05 UTC, engineers can immediately see which reports downstream are now stale. When compliance asks where customer email addresses flow, you can trace from raw source through every transformation to final storage. Instead of hours of detective work, you get answers in seconds.💡 Key Takeaways
✓Lineage captures three entities: processes (job definitions), runs (executions with timestamps), and events (read/write operations on datasets)
✓At large organizations, lineage graphs can contain tens of millions of links connecting thousands of datasets across multiple systems and clouds
✓Primary use cases include impact analysis before schema changes, incident debugging to find affected downstream systems, and compliance tracking for regulated data
✓Without lineage, finding the root cause of bad data in a complex platform can take hours or days of manual investigation
✓Lineage systems must handle scale limits, such as restricting graph traversals to 20 hops and 10,000 links to keep query latency under a few hundred milliseconds
📌 Examples
1A batch job processes 2 TB of clickstream data every 15 minutes. It emits lineage events recording that it read from <code>raw.events</code> at 03:15 UTC and wrote to <code>processed.user_sessions</code>, creating a dependency link in the graph.
2When a critical fact table changes schema, engineers query lineage to discover it has 150 downstream dependencies including dashboards, ML models, and other tables. This prevents breaking production systems.
3During a production incident at 03:05 UTC, SREs use lineage to trace which downstream reports are stale. Instead of manually checking hundreds of jobs, they get a visual graph showing 47 affected datasets within seconds.
4A compliance team needs to verify that EU customer data never flows to US only systems. They query column level lineage for the <code>user.region</code> field and trace every downstream table and process that touches it.