Training Infrastructure & PipelinesData Versioning & LineageHard⏱️ ~3 min

Production Manifests: Linking Data, Code, and Environment

A production ready versioning system treats all machine learning artifacts as immutable, content addressed assets connected by manifests. Every pipeline run persists a manifest recording input data versions as cryptographic hashes or stream offsets, code commit hash, dependency lock file with exact package versions, environment fingerprint capturing operating system and GPU driver versions, execution parameters, and output artifact hashes. This single document enables complete reproducibility months later. The manifest structure creates a chain of custody from raw data through transformations to trained models and predictions. When a model serves a prediction, logging metadata including timestamp, model version hash, feature vector version, and request identifier allows forensic replay. If accuracy degrades, you walk the lineage graph backward through the model manifest to training data versions, then to feature pipeline code and raw source offsets, pinpointing exactly what changed. Implementation requires a two phase commit protocol. First, write data and artifacts to storage. Second, atomically register them in the catalog with manifest metadata. Third, emit lineage events only after registration succeeds. This ensures consumers only see committed, fully materialized versions with snapshot isolation. For multi table writes, a logical transaction identifier ties all outputs together. Reference counting in metadata prevents premature deletion of base snapshots that have dependent deltas still in use, avoiding broken reconstruction chains.
💡 Key Takeaways
Manifests record input data hashes or offsets, code commit, dependency lock with package versions, environment fingerprint with OS and GPU drivers, parameters, and output hashes for complete reproducibility
Content addressed artifacts use cryptographic hashes as identifiers, providing idempotent uploads, natural deduplication across versions, and integrity verification without trusted storage
Two phase commit protocol atomically writes artifacts then registers manifests, ensuring consumers only see fully materialized versions with snapshot isolation guarantees
Reference counting in metadata prevents deletion of base snapshots while dependent deltas remain in use, avoiding broken reconstruction chains during retention policy enforcement
Prediction time logging captures model version, feature vector versions, and request identifiers, enabling forensic replay by walking lineage backward from serving through training to raw sources
📌 Examples
Netflix Metaflow metadata service stores manifests linking each workflow step to code, parameters, inputs, and outputs; reproducing a training run 6 months later pulls exact commits, pinned dependencies, and data hashes
Meta FBLearner Flow produces immutable artifacts with manifest metadata connecting pipeline executions; reproducibility works across large scale experimentation by versioning code, data, and hardware configurations
A manifest for training run includes inputs as data_v1:sha256:abc123 and features_v2:sha256:def456, code commit 7a3f2b9, requirements.lock hash, Ubuntu 22.04 with CUDA 12.1, and outputs model:sha256:model123 with metrics
← Back to Data Versioning & Lineage Overview
Production Manifests: Linking Data, Code, and Environment | Data Versioning & Lineage - System Overflow