What is Data Versioning in Machine Learning Pipelines?

Definition
Data versioning makes datasets immutable and addressable so any data used in ML can be referenced precisely and reproduced later. Instead of overwriting files or tables, every change creates a new version with a unique identifier, whether a cryptographic hash, timestamp, or offset range in a stream.
The Git Analogy
Think of it like Git for data. When you train a model on a dataset, you want to know exactly which rows were included, which transformations were applied, and be able to recreate that exact state months later for debugging or retraining. Without versioning, data silently changes underneath you. A table gets updated, features are recalculated, or upstream sources evolve, and suddenly your experiment results are no longer reproducible.
Versioning Strategies
Production systems mix several strategies. Full snapshots copy the entire dataset per version, giving fast reads but expensive storage. Delta based approaches store only changes from a base version, slashing storage costs by 6 to 7 times in typical scenarios but requiring rehydration when reading old versions. Append only offsets define versions by stream positions, perfect for event logs where you materialize a dataset by reading up to a specific offset or timestamp.
The Immutability Principle
The key principle is immutability: once written, a version never changes, so references remain valid forever. This enables reproducibility, audit trails, and reliable rollback when model performance degrades after a data change.

💡 Key Takeaways

✓Immutability is fundamental: versions never change after creation, making all references stable and reproducible across time

✓Full snapshots provide instant access to complete datasets but cost 300 terabytes per month for a 10 terabyte dataset with 30 day retention and daily versions

✓Delta based storage reduces costs to 44.6 terabytes per month by storing weekly checkpoints plus daily changes, saving 85% while keeping reconstruction under minutes

✓Append only stream offsets define versions by position in event logs, enabling time travel by replaying up to specific offsets or watermarks without storing redundant copies

✓Content addressed artifacts use cryptographic hashes as identifiers, providing natural deduplication and integrity verification across distributed storage systems

📌 Interview Tips

1Netflix Metaflow captures immutable artifact versions with content addressing for every workflow step, recording inputs, code, parameters, and outputs for exact reproducibility

2Kafka based pipelines define dataset versions by bounded offsets across partitions, such as all events up to offset 5,000,000 in partition 0 through 127, checkpointing weekly to cap read time

3A 10 terabyte dataset with 1% daily churn generates 100 gigabytes of changes per day; storing weekly full snapshots plus six daily deltas costs $1,026 per month versus $6,900 for 30 daily full snapshots

← Back to Data Versioning & Lineage Overview