Time Series ForecastingFeature Engineering (Lag Features, Rolling Stats, Seasonality)Hard⏱️ ~3 min

Feature Pipeline Architecture and Operational Patterns

Batch Feature Computation

Compute features in scheduled batch jobs: ingest raw data, compute lag features, rolling statistics, calendar features, write to feature store. Run daily or hourly depending on forecast frequency. Batch computation handles complex aggregations efficiently using distributed frameworks (Spark, BigQuery). Features are then served from low-latency storage.

Pipeline Structure: Raw data lake → Feature computation (Spark/SQL) → Feature store (Redis/DynamoDB) → Model serving. Each stage has monitoring, alerting, and fallback strategies for failures.

Streaming for Fresh Features

Some features need near-real-time updates: last-hour sales, current session activity. Streaming pipelines (Kafka, Flink) maintain running aggregations updated with each event. Trade-off: streaming is more complex and expensive than batch. Use streaming only for features where freshness provides measurable forecast improvement.

Consistent Training and Serving

Features computed differently in training versus serving cause training-serving skew. Mitigation: single codebase for feature computation used by both batch (training) and serving paths. Alternatively, log serving-time features and use logged values for training, guaranteeing identical features.

Best Practice: Define features declaratively (feature name, source columns, aggregation logic). Generate both batch SQL and serving code from the same definition. This eliminates implementation divergence.

Backfill and Recovery

Historical feature values needed for training must be backfilled. For lag and rolling features, this requires historical raw data. Design pipelines to recompute features from raw data when logic changes. Store raw data with sufficient retention (2+ years for yearly seasonality). Recovery from pipeline failures should be idempotent—rerunning produces same results.

💡 Key Takeaways
Batch pipeline: raw data → feature computation (Spark) → feature store (Redis) → model serving
Use streaming only where freshness measurably improves forecasts—it is more complex and expensive
Define features declaratively and generate both batch and serving code from same definition
📌 Interview Tips
1Log serving-time features and use logged values for training to guarantee identical features
2Store raw data with 2+ years retention for yearly seasonality backfills; ensure idempotent recovery
← Back to Feature Engineering (Lag Features, Rolling Stats, Seasonality) Overview
Feature Pipeline Architecture and Operational Patterns | Feature Engineering (Lag Features, Rolling Stats, Seasonality) - System Overflow