Data Lineage: Tracking Transformations and Dependencies
What Data Lineage Captures
Data lineage captures where data came from and every transformation applied to it, forming a directed graph where nodes are datasets, columns, features, or models, and edges represent operations like joins, aggregations, or training runs. This graph connects data versions, code versions, environment specifications, and model artifacts into a complete audit trail.
Lineage Granularity Trade-offs
Lineage operates at different granularities with corresponding cost trade-offs. Table level lineage tracks dependencies between whole datasets, costing minimal overhead and scaling to millions of nodes. Column level lineage maps which output columns depend on which input columns, essential for schema evolution safety and adding moderate metadata overhead. Row level lineage traces individual output rows back to specific input rows, inflating storage 5 to 20 times and doubling CPU costs, reserved for regulated workloads and sensitive PII transformations.
Implementation Architecture
Production lineage systems store raw event logs as append only streams and build derived graph indexes with denormalized adjacency lists. At scale, lineage graphs reach millions of nodes and tens of millions of edges. Keeping common queries under 500 milliseconds requires precomputing two to three hop neighborhoods, caching hot paths, and periodic compaction to collapse identical edges across daily partitions.
The Payoff
The payoff is impact analysis answering what breaks if an upstream schema changes, reproducibility for any historical experiment, and root cause analysis when model accuracy suddenly drops.