Learn→Training Infrastructure & Pipelines→Experiment Tracking & Reproducibility→1 of 6

Training Infrastructure & Pipelines • Experiment Tracking & ReproducibilityEasy⏱️ ~3 min

What is Experiment Tracking and Reproducibility in ML Systems?

Definition
Experiment tracking captures the complete provenance of an ML run: code snapshot, data snapshot, configuration, environment, hardware, runtime logs, metrics, and outputs. Reproducibility means you can re-execute an experiment months later on different hardware after library updates and recover the same outcome, or at least understand why results differ.
Lineage Graph Architecture
In production systems, this operates as a lineage graph where each run is a node connected to the artifacts it consumed and produced. A run might consume a dataset snapshot and feature definitions, then produce model weights and evaluation reports. Every run carries critical metadata: unique run ID, parent run ID for tracking evolution, code version as a commit hash, environment fingerprint like a container image digest, random seeds, hardware profile showing CPU or GPU type, and the full configuration of hyperparameters and data filters.
Three Levels of Reproducibility
Production systems recognize three levels of reproducibility. Bitwise exact means identical binary output, useful for audits and regulated industries. Algorithmic deterministic means same numbers within floating point precision, suitable for most production use cases. Statistical means same distribution within confidence bounds, acceptable for research and hyperparameter optimization where you run multiple trials anyway.
Core Architecture
The core architecture separates concerns: a write optimized metadata store handles run events, a durable artifact store manages large binaries like models and datasets, and a lineage service builds and queries the provenance graph. Logging happens asynchronously to avoid slowing down training. Gating policies in CI/CD block deployments that lack reproducibility guarantees or fail evaluation thresholds.

💡 Key Takeaways

✓Experiment tracking captures full provenance: code, data, config, environment, hardware, logs, metrics, and outputs for every machine learning run

✓Three reproducibility levels: Bitwise exact for audits, algorithmic deterministic for production (same numbers within float precision), statistical for research (same distribution within confidence bounds)

✓Production architecture uses three components: write optimized metadata store for events, durable artifact store for large binaries, lineage service for provenance queries

✓Meta handles millions of experiments yearly with FBLearner Flow; Netflix generates thousands of Metaflow tasks daily with sub-second per step overhead

✓Typical mid size org with 5 teams running 500 runs per day generates 70 to 85 GB daily, requiring 2 to 2.5 TB for 30 day hot retention

✓Asynchronous logging keeps training overhead under 1 percent while maintaining complete audit trails and enabling CI/CD gates that block non reproducible deployments

📌 Interview Tips

1Meta FBLearner Flow: Handles tens of thousands of run events per day supporting ranking, vision, and NLP use cases with centralized metadata capturing code version, input datasets, hyperparameters, and evaluation results

2Uber Michelangelo with Zipline: Supports thousands of production models across marketplace and ETA prediction, experiments on 100 million to 1 billion labeled rows, stores model binaries from 10 to 500 MB with exact feature definition lineage

3Google TFX with ML Metadata: Runs hundreds of pipelines with tens of thousands of pipeline steps daily, only pushes models when Evaluator finds statistically significant improvements with per-slice metrics and confidence intervals

← Back to Experiment Tracking & Reproducibility Overview