Learn→Training Infrastructure & Pipelines→Experiment Tracking & Reproducibility→5 of 6

Training Infrastructure & Pipelines • Experiment Tracking & ReproducibilityHard⏱️ ~3 min

Lineage Graphs and Promotion Gates

What Lineage Graphs Model
A lineage graph models runs, parameters, metrics, datasets, models, and environments as nodes with edges capturing consumed and produced relationships plus timestamps. This enables queries like find all models trained on dataset snapshot S or rebuild run R with identical inputs. Each edge records not just the artifact reference but also the access pattern: did the run read the entire dataset or a filtered subset, did it use a specific feature version from a feature store, what was the exact preprocessing transform applied. This metadata powers impact analysis when you discover a data quality issue.
Promotion Gates in CI/CD
Promotion gates integrate with CI/CD pipelines so only runs with complete provenance and evaluation passing pre-registered thresholds can be registered or pushed to production. A typical gate checks: Does this run have a dataset fingerprint? Is the code commit tagged? Does the environment digest match an approved base image? Do evaluation metrics exceed the current production model by a statistically significant margin? Did the evaluator run on time sliced and feature sliced test sets? The system records the decision context including who approved, when, and why for audits.
Production Implementations
Google TFX uses an Evaluator component that compares candidates against baseline models on per-slice metrics with confidence intervals. Models only get pushed if they show statistically significant improvements, preventing noise from causing rollouts. Uber Michelangelo ties each model to exact feature definitions from Zipline and backfilled data windows, blocking registration if lineage is incomplete.
Statistical Rigor for Comparisons
Comparison at scale requires normalizing metric names and slicing schemas across runs. Standardize on a core schema with extensible tags for custom fields. For statistical rigor, support summaries across repeated runs: mean, variance, confidence intervals. For small improvements under 0.5 percent relative, run N equals 3 to 10 repeats and only promote if the confidence interval excludes zero. This increases compute cost by N times but prevents overfitting to test set noise and reduces Type I errors.

💡 Key Takeaways

✓Lineage graph models runs, datasets, models, environments as nodes with consumed and produced edges recording timestamps and access patterns like filtered subsets or specific feature versions

✓Promotion gates block deployment unless run has dataset fingerprint, tagged code commit, approved environment digest, and evaluation metrics exceeding baseline by statistically significant margin

✓Google TFX Evaluator pushes models only with statistically significant per-slice improvements; Uber Michelangelo blocks registration without complete Zipline feature lineage and backfill windows

✓Comparison at scale normalizes metric names across runs; for improvements under 0.5 percent relative run N equals 3 to 10 repeats reporting mean and confidence intervals to prevent test set overfitting

✓Lineage enables impact analysis queries: Find all models trained on dataset snapshot S or rebuild run R with identical inputs by traversing consumed edges in provenance graph

✓Statistical rigor: Running 10 repeats increases compute cost 10x but reduces Type I error from iterating many candidates against same validation slice; only promote if confidence interval excludes zero

📌 Interview Tips

1Google TFX: Evaluator component compares candidate versus baseline on time sliced and feature sliced test sets with confidence intervals, blocks push if improvement not statistically significant at p < 0.05

2Uber Michelangelo: Model registration checks Zipline feature definition versions and backfill window timestamps; blocks if lineage incomplete preventing training serving skew from feature drift

3Meta FBLearner Flow: Side by side comparison dashboards for hundreds of hyperparameter search runs with normalized metric schema enabling identification of best candidate from large sweeps

4Lineage query example: SELECT model_id, dataset_snapshot, code_commit FROM lineage WHERE dataset_snapshot = 'user_events_20240115' AND metric_f1 > 0.85 ORDER BY metric_f1 DESC LIMIT 10

← Back to Experiment Tracking & Reproducibility Overview