Training Infrastructure & PipelinesData Versioning & LineageMedium⏱️ ~3 min

Data Lineage: Tracking Transformations and Dependencies

Data lineage captures where data came from and every transformation applied to it, forming a directed graph where nodes are datasets, columns, features, or models, and edges represent operations like joins, aggregations, or training runs. This graph connects data versions, code versions, environment specifications, and model artifacts into a complete audit trail. Lineage operates at different granularities with corresponding cost trade-offs. Table level lineage tracks dependencies between whole datasets, costing minimal overhead and scaling to millions of nodes. Column level lineage maps which output columns depend on which input columns, essential for schema evolution safety and adding moderate metadata overhead. Row level lineage traces individual output rows back to specific input rows, inflating storage 5 to 20 times and doubling CPU costs, reserved for regulated workloads and sensitive personally identifiable information (PII) transformations. Production lineage systems store raw event logs as append only streams and build derived graph indexes with denormalized adjacency lists. At scale, lineage graphs reach millions of nodes and tens of millions of edges. Keeping common queries under 500 milliseconds requires precomputing two to three hop neighborhoods, caching hot paths, and periodic compaction to collapse identical edges across daily partitions. The payoff is impact analysis answering what breaks if an upstream schema changes, reproducibility for any historical experiment, and root cause analysis when model accuracy suddenly drops.
💡 Key Takeaways
Table level lineage provides coarse impact analysis at minimal cost, suitable for warehouse scale tracking across thousands of datasets with subsecond query performance
Column level lineage enables precise schema evolution safety and feature level audits, tracking which output columns derive from which input columns with moderate metadata overhead
Row level lineage inflates storage 5 to 20 times and doubles compute costs, justified only for regulated workloads or sensitive join operations requiring full audit trails
Lineage graphs at enterprise scale contain millions of nodes and tens of millions of edges, requiring denormalized adjacency lists and precomputed summaries to keep queries under 500 milliseconds
Centralized orchestrator based lineage achieves consistent coverage for scheduled jobs but misses ad hoc notebooks; app instrumented approaches capture heterogeneous workloads but require developer discipline
📌 Examples
Uber Michelangelo tracks features as versioned, timestamped entities with lineage connecting models to feature sets and back to raw sources, preventing training serving skew through consistent offline and online stores
Airbnb Minerva defines canonical metrics as versioned transformations with full lineage, enabling impact analysis across dashboards, experiments, and ML models to guarantee consistent governance
A lineage query finding all downstream impacts within 3 hops of a schema change runs in under 500 milliseconds by precomputing neighborhood summaries and caching frequently accessed paths in a graph optimized store
← Back to Data Versioning & Lineage Overview
Data Lineage: Tracking Transformations and Dependencies | Data Versioning & Lineage - System Overflow