Feature Pipeline Architecture and Operational Patterns
Batch Feature Computation
Compute features in scheduled batch jobs: ingest raw data, compute lag features, rolling statistics, calendar features, write to feature store. Run daily or hourly depending on forecast frequency. Batch computation handles complex aggregations efficiently using distributed frameworks (Spark, BigQuery). Features are then served from low-latency storage.
Pipeline Structure: Raw data lake → Feature computation (Spark/SQL) → Feature store (Redis/DynamoDB) → Model serving. Each stage has monitoring, alerting, and fallback strategies for failures.
Streaming for Fresh Features
Some features need near-real-time updates: last-hour sales, current session activity. Streaming pipelines (Kafka, Flink) maintain running aggregations updated with each event. Trade-off: streaming is more complex and expensive than batch. Use streaming only for features where freshness provides measurable forecast improvement.
Consistent Training and Serving
Features computed differently in training versus serving cause training-serving skew. Mitigation: single codebase for feature computation used by both batch (training) and serving paths. Alternatively, log serving-time features and use logged values for training, guaranteeing identical features.
Best Practice: Define features declaratively (feature name, source columns, aggregation logic). Generate both batch SQL and serving code from the same definition. This eliminates implementation divergence.
Backfill and Recovery
Historical feature values needed for training must be backfilled. For lag and rolling features, this requires historical raw data. Design pipelines to recompute features from raw data when logic changes. Store raw data with sufficient retention (2+ years for yearly seasonality). Recovery from pipeline failures should be idempotent—rerunning produces same results.