Training Infrastructure & PipelinesExperiment Tracking & ReproducibilityHard⏱️ ~3 min

Lineage Graphs and Promotion Gates

A lineage graph models runs, parameters, metrics, datasets, models, and environments as nodes with edges capturing consumed and produced relationships plus timestamps. This enables queries like find all models trained on dataset snapshot S or rebuild run R with identical inputs. Each edge records not just the artifact reference but also the access pattern: did the run read the entire dataset or a filtered subset, did it use a specific feature version from a feature store, what was the exact preprocessing transform applied. This metadata powers impact analysis when you discover a data quality issue. Promotion gates integrate with CI/CD pipelines so only runs with complete provenance and evaluation passing pre-registered thresholds can be registered or pushed to production. A typical gate checks: Does this run have a dataset fingerprint? Is the code commit tagged? Does the environment digest match an approved base image? Do evaluation metrics exceed the current production model by a statistically significant margin? Did the evaluator run on time sliced and feature sliced test sets? The system records the decision context including who approved, when, and why for audits. Google TFX uses an Evaluator component that compares candidates against baseline models on per-slice metrics with confidence intervals. Models only get pushed if they show statistically significant improvements, preventing noise from causing rollouts. Uber Michelangelo ties each model to exact feature definitions from Zipline and backfilled data windows, blocking registration if lineage is incomplete. Meta FBLearner Flow enables side by side comparisons of hundreds of runs with normalized metric schemas, helping teams identify the best candidate from large hyperparameter sweeps. Comparison at scale requires normalizing metric names and slicing schemas across runs. If one experiment logs "accuracy" and another logs "acc", dashboards cannot compare them. Standardize on a core schema with extensible tags for custom fields. For statistical rigor, support summaries across repeated runs: mean, variance, confidence intervals. For small improvements under 0.5 percent relative, run N equals 3 to 10 repeats and only promote if the confidence interval excludes zero. This increases compute cost by N times but prevents overfitting to test set noise and reduces Type I errors from iterating many candidates against the same validation slice.
💡 Key Takeaways
Lineage graph models runs, datasets, models, environments as nodes with consumed and produced edges recording timestamps and access patterns like filtered subsets or specific feature versions
Promotion gates block deployment unless run has dataset fingerprint, tagged code commit, approved environment digest, and evaluation metrics exceeding baseline by statistically significant margin
Google TFX Evaluator pushes models only with statistically significant per-slice improvements; Uber Michelangelo blocks registration without complete Zipline feature lineage and backfill windows
Comparison at scale normalizes metric names across runs; for improvements under 0.5 percent relative run N equals 3 to 10 repeats reporting mean and confidence intervals to prevent test set overfitting
Lineage enables impact analysis queries: Find all models trained on dataset snapshot S or rebuild run R with identical inputs by traversing consumed edges in provenance graph
Statistical rigor: Running 10 repeats increases compute cost 10x but reduces Type I error from iterating many candidates against same validation slice; only promote if confidence interval excludes zero
📌 Examples
Google TFX: Evaluator component compares candidate versus baseline on time sliced and feature sliced test sets with confidence intervals, blocks push if improvement not statistically significant at p < 0.05
Uber Michelangelo: Model registration checks Zipline feature definition versions and backfill window timestamps; blocks if lineage incomplete preventing training serving skew from feature drift
Meta FBLearner Flow: Side by side comparison dashboards for hundreds of hyperparameter search runs with normalized metric schema enabling identification of best candidate from large sweeps
Lineage query example: SELECT model_id, dataset_snapshot, code_commit FROM lineage WHERE dataset_snapshot = 'user_events_20240115' AND metric_f1 > 0.85 ORDER BY metric_f1 DESC LIMIT 10
← Back to Experiment Tracking & Reproducibility Overview