Training Infrastructure & PipelinesExperiment Tracking & ReproducibilityEasy⏱️ ~3 min

What is Experiment Tracking and Reproducibility in ML Systems?

Experiment tracking is the discipline of capturing the complete provenance of a machine learning run: code snapshot, data snapshot, configuration, environment, hardware, runtime logs, metrics, and outputs. Think of it as a detailed lab notebook that records everything needed to understand and recreate an experiment. Reproducibility goes beyond just fixing a random seed. It means you can re-execute an experiment months later on different hardware after library updates and recover the same outcome, or at least understand why results differ. In production systems, this operates as a lineage graph where each run is a node connected to the artifacts it consumed and produced. A run might consume a dataset snapshot and feature definitions, then produce model weights and evaluation reports. Every run carries critical metadata: unique run ID, parent run ID for tracking evolution, code version as a commit hash, environment fingerprint like a container image digest, random seeds, hardware profile showing CPU or GPU type, and the full configuration of hyperparameters and data filters. Production systems recognize three levels of reproducibility. Bitwise exact means identical binary output, useful for audits and regulated industries. Algorithmic deterministic means same numbers within floating point precision, suitable for most production use cases. Statistical means same distribution within confidence bounds, acceptable for research and hyperparameter optimization where you run multiple trials anyway. Meta processes millions of experiments per year with FBLearner Flow. Netflix uses Metaflow to generate thousands of workflow tasks daily with complete audit trails. The core architecture separates concerns: a write optimized metadata store handles run events, a durable artifact store manages large binaries like models and datasets, and a lineage service builds and queries the provenance graph. Logging happens asynchronously to avoid slowing down training. Gating policies in continuous integration and continuous deployment (CI/CD) block deployments that lack reproducibility guarantees or fail evaluation thresholds.
💡 Key Takeaways
Experiment tracking captures full provenance: code, data, config, environment, hardware, logs, metrics, and outputs for every machine learning run
Three reproducibility levels: Bitwise exact for audits, algorithmic deterministic for production (same numbers within float precision), statistical for research (same distribution within confidence bounds)
Production architecture uses three components: write optimized metadata store for events, durable artifact store for large binaries, lineage service for provenance queries
Meta handles millions of experiments yearly with FBLearner Flow; Netflix generates thousands of Metaflow tasks daily with sub-second per step overhead
Typical mid size org with 5 teams running 500 runs per day generates 70 to 85 GB daily, requiring 2 to 2.5 TB for 30 day hot retention
Asynchronous logging keeps training overhead under 1 percent while maintaining complete audit trails and enabling CI/CD gates that block non reproducible deployments
📌 Examples
Meta FBLearner Flow: Handles tens of thousands of run events per day supporting ranking, vision, and NLP use cases with centralized metadata capturing code version, input datasets, hyperparameters, and evaluation results
Uber Michelangelo with Zipline: Supports thousands of production models across marketplace and ETA prediction, experiments on 100 million to 1 billion labeled rows, stores model binaries from 10 to 500 MB with exact feature definition lineage
Google TFX with ML Metadata: Runs hundreds of pipelines with tens of thousands of pipeline steps daily, only pushes models when Evaluator finds statistically significant improvements with per-slice metrics and confidence intervals
← Back to Experiment Tracking & Reproducibility Overview
What is Experiment Tracking and Reproducibility in ML Systems? | Experiment Tracking & Reproducibility - System Overflow