Training Infrastructure & PipelinesTraining-Serving Skew PreventionMedium⏱️ ~3 min

Logging and Measurement: Building Training Data from Production

The second pillar of skew prevention is explicit measurement through production feature logging. The gold standard is logging the exact feature vector used for each prediction at serving time, then building your next training dataset directly from these logs. This eliminates entire classes of skew because training literally uses what production saw, including all the quirks: missing values from timeouts, stale cache entries, upstream service failures, and edge cases that never appear in clean offline pipelines. At 20,000 queries per second (QPS) with 1 kilobyte per feature vector, you generate 20 megabytes per second of logs, which is 72 gigabytes per hour or 1.7 terabytes per day. Real systems control this through sampling (log 1% to 10% of predictions), compression (Protocol Buffers or Avro), and schema minimization (feature IDs and hashed values rather than raw strings). Google's TensorFlow Extended (TFX) style stacks log features, model outputs, and eventually observed labels, creating a closed loop where serving directly feeds training. This enables powerful validation workflows. Before deploying a new model, you run blocking skew tests: take 1,000 to 10,000 examples from your offline pipeline with expected outputs, load your packaged model with bundled transforms, run both paths, and assert output deltas below tight thresholds (maximum absolute difference less than 0.000001 for deterministic models). If there's a mismatch, you binary search the stack: features, transforms, model to localize the divergence. Meta runs these tests on every model deployment, preventing skewed models from reaching production. Continuous monitoring extends this to live traffic. You compute distribution metrics between your training reference dataset and live serving slices: Population Stability Index (PSI) warns above 0.1 and alerts above 0.2, Kolmogorov Smirnov (KS) statistics for continuous features, Jensen Shannon (JS) divergence for categorical distributions. When Uber's fraud model sees PSI spike on device fingerprint features, it triggers investigation before false positive rates climb. Netflix monitors not just feature distributions but model output calibration: if predicted CTR distributions shift while actual CTR stays stable, serving behavior has diverged from training.
💡 Key Takeaways
Gold standard: Log exact feature vectors at serving time (sampled 1% to 10%), build next training dataset from these production logs to eliminate offline online pipeline divergence
Scale and cost: 20,000 QPS at 1 kilobyte per record yields 72 gigabytes per hour uncompressed; control via sampling, Protocol Buffers compression, schema minimization (feature IDs not raw values)
Blocking skew tests pre deploy: Run 1,000 to 10,000 examples through offline pipeline and packaged serving model, assert output deltas below 0.000001 absolute difference for deterministic models
Continuous drift detection: Population Stability Index (PSI) warns above 0.1 and alerts above 0.2, Kolmogorov Smirnov test for continuous features, Jensen Shannon divergence for categorical distributions
Trade off: Full logging provides strongest parity but raises privacy concerns (Personally Identifiable Information or PII compliance) and storage costs; balance with sampling, hashing, access controls
📌 Examples
Google TensorFlow Extended (TFX): Logs features, predictions, and delayed labels from serving, builds training data directly from logs, runs validation comparing offline versus online model outputs
Uber fraud detection: Monitors PSI on device fingerprint and transaction features; PSI spike above 0.15 triggers investigation before false positive rates increase, preventing customer friction
Meta feed ranking: Pre deployment tests run 5,000 examples comparing offline ranking scores versus packaged model outputs; any delta above 0.00001 blocks deployment until resolved
← Back to Training-Serving Skew Prevention Overview