Training Infrastructure & PipelinesTraining-Serving Skew PreventionMedium⏱️ ~3 min

Logging and Measurement: Building Training Data from Production

Production Feature Logging

The second pillar of skew prevention is explicit measurement through production feature logging. The gold standard is logging the exact feature vector used for each prediction at serving time, then building your next training dataset directly from these logs. This eliminates entire classes of skew because training literally uses what production saw, including all the quirks: missing values from timeouts, stale cache entries, upstream service failures, and edge cases that never appear in clean offline pipelines.

Scale Considerations

At 20,000 QPS with 1 kilobyte per feature vector, you generate 20 megabytes per second of logs, which is 72 gigabytes per hour or 1.7 terabytes per day. Real systems control this through sampling (log 1 to 10 percent of predictions), compression (Protocol Buffers or Avro), and schema minimization. Google's TFX style stacks log features, model outputs, and eventually observed labels, creating a closed loop where serving directly feeds training.

Blocking Skew Tests

This enables powerful validation workflows. Before deploying a new model, you run blocking skew tests: take 1,000 to 10,000 examples from your offline pipeline with expected outputs, load your packaged model with bundled transforms, run both paths, and assert output deltas below tight thresholds (maximum absolute difference less than 0.000001 for deterministic models). If there is a mismatch, you binary search the stack to localize the divergence.

Continuous Monitoring

Continuous monitoring extends this to live traffic. You compute distribution metrics between your training reference dataset and live serving slices: PSI warns above 0.1 and alerts above 0.2, KS statistics for continuous features, JS divergence for categorical distributions. When fraud models see PSI spike on device fingerprint features, it triggers investigation before false positive rates climb.

💡 Key Takeaways
Gold standard: Log exact feature vectors at serving time (sampled 1% to 10%), build next training dataset from these production logs to eliminate offline online pipeline divergence
Scale and cost: 20,000 QPS at 1 kilobyte per record yields 72 gigabytes per hour uncompressed; control via sampling, Protocol Buffers compression, schema minimization (feature IDs not raw values)
Blocking skew tests pre deploy: Run 1,000 to 10,000 examples through offline pipeline and packaged serving model, assert output deltas below 0.000001 absolute difference for deterministic models
Continuous drift detection: Population Stability Index (PSI) warns above 0.1 and alerts above 0.2, Kolmogorov Smirnov test for continuous features, Jensen Shannon divergence for categorical distributions
Trade off: Full logging provides strongest parity but raises privacy concerns (Personally Identifiable Information or PII compliance) and storage costs; balance with sampling, hashing, access controls
📌 Interview Tips
1Google TensorFlow Extended (TFX): Logs features, predictions, and delayed labels from serving, builds training data directly from logs, runs validation comparing offline versus online model outputs
2Uber fraud detection: Monitors PSI on device fingerprint and transaction features; PSI spike above 0.15 triggers investigation before false positive rates increase, preventing customer friction
3Meta feed ranking: Pre deployment tests run 5,000 examples comparing offline ranking scores versus packaged model outputs; any delta above 0.00001 blocks deployment until resolved
← Back to Training-Serving Skew Prevention Overview
Logging and Measurement: Building Training Data from Production | Training-Serving Skew Prevention - System Overflow