Production Manifests: Linking Data, Code, and Environment

Manifests as Single Source of Truth
A production ready versioning system treats all ML artifacts as immutable, content addressed assets connected by manifests. Every pipeline run persists a manifest recording input data versions as cryptographic hashes or stream offsets, code commit hash, dependency lock file with exact package versions, environment fingerprint capturing operating system and GPU driver versions, execution parameters, and output artifact hashes. This single document enables complete reproducibility months later.
Chain of Custody
The manifest structure creates a chain of custody from raw data through transformations to trained models and predictions. When a model serves a prediction, logging metadata including timestamp, model version hash, feature vector version, and request identifier allows forensic replay. If accuracy degrades, you walk the lineage graph backward through the model manifest to training data versions, then to feature pipeline code and raw source offsets, pinpointing exactly what changed.
Two Phase Commit Protocol
Implementation requires a two phase commit protocol. First, write data and artifacts to storage. Second, atomically register them in the catalog with manifest metadata. Third, emit lineage events only after registration succeeds. This ensures consumers only see committed, fully materialized versions with snapshot isolation.
Multi-Table Transactions
For multi table writes, a logical transaction identifier ties all outputs together. Reference counting in metadata prevents premature deletion of base snapshots that have dependent deltas still in use, avoiding broken reconstruction chains.

💡 Key Takeaways

✓Manifests record input data hashes or offsets, code commit, dependency lock with package versions, environment fingerprint with OS and GPU drivers, parameters, and output hashes for complete reproducibility

✓Content addressed artifacts use cryptographic hashes as identifiers, providing idempotent uploads, natural deduplication across versions, and integrity verification without trusted storage

✓Two phase commit protocol atomically writes artifacts then registers manifests, ensuring consumers only see fully materialized versions with snapshot isolation guarantees

✓Reference counting in metadata prevents deletion of base snapshots while dependent deltas remain in use, avoiding broken reconstruction chains during retention policy enforcement

✓Prediction time logging captures model version, feature vector versions, and request identifiers, enabling forensic replay by walking lineage backward from serving through training to raw sources

📌 Interview Tips

1Netflix Metaflow metadata service stores manifests linking each workflow step to code, parameters, inputs, and outputs; reproducing a training run 6 months later pulls exact commits, pinned dependencies, and data hashes

2Meta FBLearner Flow produces immutable artifacts with manifest metadata connecting pipeline executions; reproducibility works across large scale experimentation by versioning code, data, and hardware configurations

3A manifest for training run includes inputs as data_v1:sha256:abc123 and features_v2:sha256:def456, code commit 7a3f2b9, requirements.lock hash, Ubuntu 22.04 with CUDA 12.1, and outputs model:sha256:model123 with metrics

← Back to Data Versioning & Lineage Overview