Lineage Graphs and Promotion Gates
What Lineage Graphs Model
A lineage graph models runs, parameters, metrics, datasets, models, and environments as nodes with edges capturing consumed and produced relationships plus timestamps. This enables queries like find all models trained on dataset snapshot S or rebuild run R with identical inputs. Each edge records not just the artifact reference but also the access pattern: did the run read the entire dataset or a filtered subset, did it use a specific feature version from a feature store, what was the exact preprocessing transform applied. This metadata powers impact analysis when you discover a data quality issue.
Promotion Gates in CI/CD
Promotion gates integrate with CI/CD pipelines so only runs with complete provenance and evaluation passing pre-registered thresholds can be registered or pushed to production. A typical gate checks: Does this run have a dataset fingerprint? Is the code commit tagged? Does the environment digest match an approved base image? Do evaluation metrics exceed the current production model by a statistically significant margin? Did the evaluator run on time sliced and feature sliced test sets? The system records the decision context including who approved, when, and why for audits.
Production Implementations
Google TFX uses an Evaluator component that compares candidates against baseline models on per-slice metrics with confidence intervals. Models only get pushed if they show statistically significant improvements, preventing noise from causing rollouts. Uber Michelangelo ties each model to exact feature definitions from Zipline and backfilled data windows, blocking registration if lineage is incomplete.
Statistical Rigor for Comparisons
Comparison at scale requires normalizing metric names and slicing schemas across runs. Standardize on a core schema with extensible tags for custom fields. For statistical rigor, support summaries across repeated runs: mean, variance, confidence intervals. For small improvements under 0.5 percent relative, run N equals 3 to 10 repeats and only promote if the confidence interval excludes zero. This increases compute cost by N times but prevents overfitting to test set noise and reduces Type I errors.