Data Lineage: Tracking Transformations and Dependencies

What Data Lineage Captures
Data lineage captures where data came from and every transformation applied to it, forming a directed graph where nodes are datasets, columns, features, or models, and edges represent operations like joins, aggregations, or training runs. This graph connects data versions, code versions, environment specifications, and model artifacts into a complete audit trail.
Lineage Granularity Trade-offs
Lineage operates at different granularities with corresponding cost trade-offs. Table level lineage tracks dependencies between whole datasets, costing minimal overhead and scaling to millions of nodes. Column level lineage maps which output columns depend on which input columns, essential for schema evolution safety and adding moderate metadata overhead. Row level lineage traces individual output rows back to specific input rows, inflating storage 5 to 20 times and doubling CPU costs, reserved for regulated workloads and sensitive PII transformations.
Implementation Architecture
Production lineage systems store raw event logs as append only streams and build derived graph indexes with denormalized adjacency lists. At scale, lineage graphs reach millions of nodes and tens of millions of edges. Keeping common queries under 500 milliseconds requires precomputing two to three hop neighborhoods, caching hot paths, and periodic compaction to collapse identical edges across daily partitions.
The Payoff
The payoff is impact analysis answering what breaks if an upstream schema changes, reproducibility for any historical experiment, and root cause analysis when model accuracy suddenly drops.

💡 Key Takeaways

✓Table level lineage provides coarse impact analysis at minimal cost, suitable for warehouse scale tracking across thousands of datasets with subsecond query performance

✓Column level lineage enables precise schema evolution safety and feature level audits, tracking which output columns derive from which input columns with moderate metadata overhead

✓Row level lineage inflates storage 5 to 20 times and doubles compute costs, justified only for regulated workloads or sensitive join operations requiring full audit trails

✓Lineage graphs at enterprise scale contain millions of nodes and tens of millions of edges, requiring denormalized adjacency lists and precomputed summaries to keep queries under 500 milliseconds

✓Centralized orchestrator based lineage achieves consistent coverage for scheduled jobs but misses ad hoc notebooks; app instrumented approaches capture heterogeneous workloads but require developer discipline

📌 Interview Tips

1Uber Michelangelo tracks features as versioned, timestamped entities with lineage connecting models to feature sets and back to raw sources, preventing training serving skew through consistent offline and online stores

2Airbnb Minerva defines canonical metrics as versioned transformations with full lineage, enabling impact analysis across dashboards, experiments, and ML models to guarantee consistent governance

3A lineage query finding all downstream impacts within 3 hops of a schema change runs in under 500 milliseconds by precomputing neighborhood summaries and caching frequently accessed paths in a graph optimized store

← Back to Data Versioning & Lineage Overview