Training Infrastructure & PipelinesExperiment Tracking & ReproducibilityHard⏱️ ~3 min

Failure Modes and Edge Cases in Production Reproducibility

Training Serving Skew

Training serving skew occurs when models are trained on batch features but served with real time features, causing up to 20 percent accuracy drop. A ranking model trained on yesterday aggregated user statistics but served with current session features sees distribution shift. Data time-travel gaps happen when source systems overwrite or delete data; reproduced runs silently read different inputs. An append only raw table gets compacted or a 90 day retention policy deletes the training window you need to replay. Fix with append only storage, snapshotting, and dataset fingerprints using content hashes that fail loudly when data is missing.

Feature Skew

Feature skew between offline and online systems is subtle. Training might compute a 30 day click through rate using exact timestamps, while serving uses a cached daily aggregate that updates at midnight. The model learns on precise signals but predicts with stale approximations. Uber Zipline solves this by version controlling feature definitions and materializations, ensuring training and serving read from the same logical feature pipeline. Partial logging or missing lineage happens when teams forget to log configs or datasets; models enter the registry without provenance. Fix with pipeline gates that block artifact registration unless mandatory fields are present.

PII Leakage and Security

PII leakage in metadata occurs when teams log configurations or parameters that include secrets or PII. An experiment config might accidentally log a database connection string or customer IDs. Fix with client side redaction, allowlisted fields, and automated detectors that scan metadata for patterns like credit card numbers. Encrypt sensitive fields at rest.

Statistical Overfitting and Infrastructure Issues

Statistical overfitting to the test set inflates Type I error when iterating many runs against the same validation slice. After evaluating 100 hyperparameter combinations, the best one likely overfit to noise. Fix with pre-registered evaluation protocols, multiple repeats for top candidates, and holdback test sets accessed only for final contenders. Metadata store hot spots happen during hyperparameter optimization bursts. Fix with write optimized append only event logs, eventual materialized views, and time based partitioning.

💡 Key Takeaways
Training serving skew from batch versus real time features causes up to 20 percent accuracy drop; model trained on yesterday's aggregates served with current session features sees distribution shift
Data time-travel gaps when source systems delete within lookback window; reproduced runs silently read different inputs; fix with append only tables and dataset fingerprints that fail loudly when data missing
Feature skew: Training computes 30 day click through rate with exact timestamps, serving uses cached daily aggregate updating at midnight; Uber Zipline versions definitions ensuring consistency
PII leakage when logging configs with secrets or customer IDs; fix with client side redaction, allowlisted fields, automated pattern detectors for credit cards or social security numbers
Statistical overfitting: After 100 hyperparameter evaluations on same test set best candidate likely overfit to noise; fix with pre-registered protocols, multiple repeats, holdback sets for final contenders
Metadata hot spots: 5,000 runs in 1 hour with 10 events each creates 14 writes per second sustained with bursts to 100 per second; fix with append only logs, partitioning by time or project
📌 Interview Tips
1Uber ranking model: Training on 90 day window of Zipline batch features with exact timestamps, serving with cached hourly features caused 15% precision drop until switching to consistent feature versions
2Data deletion failure: Training window from 2024-01-01 to 2024-03-31 reproduced 6 months later after 90 day retention policy deleted January data; dataset fingerprint verification caught missing 33% of rows
3Artifact retention explosion: 200 MB model × 100 epochs × 500 runs/day = 10 TB/day; lifecycle policy keeps all 30 days (300 TB), then top 10 per experiment with deduplication reduces to 30 TB total
4Test set overfitting: Team evaluated 150 hyperparameter combinations on same validation set, top candidate showed 2% improvement but failed on holdback test set with 0.5% regression
← Back to Experiment Tracking & Reproducibility Overview
Failure Modes and Edge Cases in Production Reproducibility | Experiment Tracking & Reproducibility - System Overflow