Training Infrastructure & PipelinesExperiment Tracking & ReproducibilityHard⏱️ ~3 min

Failure Modes and Edge Cases in Production Reproducibility

Training serving skew occurs when models are trained on batch features but served with real time features, causing up to 20 percent accuracy drop. A ranking model trained on yesterday's aggregated user statistics but served with current session features sees distribution shift. Data time-travel gaps happen when source systems overwrite or delete data; reproduced runs silently read different inputs. An append only raw table gets compacted or a 90 day retention policy deletes the training window you need to replay. Fix with append only storage, snapshotting, and dataset fingerprints using content hashes or transaction IDs that fail loudly when data is missing. Feature skew between offline and online systems is subtle. Training might compute a 30 day click through rate using exact timestamps, while serving uses a cached daily aggregate that updates at midnight. The model learns on precise signals but predicts with stale approximations. Uber Zipline solves this by version controlling feature definitions and materializations, ensuring training and serving read from the same logical feature pipeline. Partial logging or missing lineage happens when teams forget to log configs or datasets; models enter the registry without provenance. Fix with pipeline gates that block artifact registration unless mandatory fields like code commit, dataset snapshot ID, and environment digest are present. Personally identifiable information (PII) leakage in metadata occurs when teams log configurations or parameters that include secrets or PII. An experiment config might accidentally log a database connection string or customer IDs. Fix with client side redaction, allowlisted fields, and automated detectors that scan metadata for patterns like credit card numbers or social security numbers. Encrypt sensitive fields at rest. Statistical overfitting to the test set inflates Type I error when iterating many runs against the same validation or test slice. After evaluating 100 hyperparameter combinations, the best one likely overfit to noise. Fix with pre-registered evaluation protocols, multiple repeats for top candidates, and holdback test sets accessed only for final contenders. Metadata store hot spots happen during hyperparameter optimization bursts. A team launches 5,000 short lived runs in an hour, each writing 10 events, creating 50,000 writes or roughly 14 writes per second sustained with bursts up to 100 writes per second. If the metadata database is not partitioned or write optimized, this causes slowdowns or failures. Fix with write optimized append only event logs, eventual materialized views, and time based or project based partitioning. Artifact retention blowups occur when storing every checkpoint; a 200 MB model saved every epoch for 100 epochs across 500 runs per day means 10 TB daily. Lifecycle policies keep all for 30 days, then top k per experiment, deduplicating identical artifacts.
💡 Key Takeaways
Training serving skew from batch versus real time features causes up to 20 percent accuracy drop; model trained on yesterday's aggregates served with current session features sees distribution shift
Data time-travel gaps when source systems delete within lookback window; reproduced runs silently read different inputs; fix with append only tables and dataset fingerprints that fail loudly when data missing
Feature skew: Training computes 30 day click through rate with exact timestamps, serving uses cached daily aggregate updating at midnight; Uber Zipline versions definitions ensuring consistency
PII leakage when logging configs with secrets or customer IDs; fix with client side redaction, allowlisted fields, automated pattern detectors for credit cards or social security numbers
Statistical overfitting: After 100 hyperparameter evaluations on same test set best candidate likely overfit to noise; fix with pre-registered protocols, multiple repeats, holdback sets for final contenders
Metadata hot spots: 5,000 runs in 1 hour with 10 events each creates 14 writes per second sustained with bursts to 100 per second; fix with append only logs, partitioning by time or project
📌 Examples
Uber ranking model: Training on 90 day window of Zipline batch features with exact timestamps, serving with cached hourly features caused 15% precision drop until switching to consistent feature versions
Data deletion failure: Training window from 2024-01-01 to 2024-03-31 reproduced 6 months later after 90 day retention policy deleted January data; dataset fingerprint verification caught missing 33% of rows
Artifact retention explosion: 200 MB model × 100 epochs × 500 runs/day = 10 TB/day; lifecycle policy keeps all 30 days (300 TB), then top 10 per experiment with deduplication reduces to 30 TB total
Test set overfitting: Team evaluated 150 hyperparameter combinations on same validation set, top candidate showed 2% improvement but failed on holdback test set with 0.5% regression
← Back to Experiment Tracking & Reproducibility Overview