Training Infrastructure & PipelinesData Versioning & LineageHard⏱️ ~3 min

Failure Modes: When Versioning and Lineage Break Down

Production versioning systems fail in predictable ways that require specific mitigations. Broken delta chains occur when base snapshots are deleted by retention policies before dependent deltas are compacted, rendering entire version ranges unrecoverable. Reference counting in metadata must track active dependents and block deletion of any base with live dependencies until compaction merges deltas into a new self contained checkpoint. Event time versus processing time confusion silently corrupts datasets. If you define stream versions by processing time when data arrives, late arriving events get assigned to wrong versions, causing data leakage in training and double counting in metrics. Always use event time with explicit watermarks and late data windows; document the allowed lateness budget in version metadata. For a payment processing pipeline, accepting events up to 24 hours late means version boundaries shift as late data arrives, requiring careful handling of downstream consumers. Training serving skew hides in lineage gaps. If online features bypass the same transformation code logged for offline pipelines, lineage appears clean but reality diverges. A handcoded aggregation in a serving microservice that differs from the Spark job computing training features creates a 10 to 20% accuracy drop that manifests only in production. Enforce single source of truth transformations and dual write lineage from both offline batch and online streaming paths, alerting when transformation logic diverges.
💡 Key Takeaways
Broken delta chains from premature base deletion render versions unrecoverable; reference counting must track dependents and block deletion of bases with active deltas until compaction completes
Event time versus processing time confusion admits late events to wrong versions causing silent data leakage; always define versions by event time with explicit watermarks and document lateness budgets
Training serving skew hidden by lineage gaps occurs when online features bypass offline transformations; handcoded service logic diverging from training pipelines causes 10 to 20% accuracy drops
Schema evolution shards lineage when renames or splits lack semantic equivalence tracking; maintain explicit schema change events linking old and new columns to preserve continuity
Non deterministic transformations with unseeded randomness, parallel aggregation order sensitivity, or time of day dependencies break reproducibility even with versioned inputs; fix seeds and record external parameters
GDPR deletion requests conflict with immutable versions retaining personally identifiable information; solutions include per subject encryption keys, detachable joinable tables, or write time redaction with salted hashes
📌 Examples
A retention policy deletes weekly checkpoint after 90 days while daily deltas from days 91 to 97 remain; reconstruction fails because base is missing; reference counting would have blocked deletion until deltas were compacted
Payment pipeline versions defined by processing time include transactions arriving 6 hours late in wrong day's dataset, causing 2% training data leakage and inflated revenue metrics until switched to event time with 24 hour watermark
Recommendation model drops 15% accuracy in production because online feature service computes user engagement as 7 day rolling average while training used 30 day aggregation; dual lineage tracking would surface divergence
User deletion request under GDPR cannot erase data from immutable historical snapshots; system switches to per user encryption keys stored separately, deleting keys to cryptographically erase access without modifying snapshots
← Back to Data Versioning & Lineage Overview