Time Series Forecasting • Feature Engineering (Lag Features, Rolling Stats, Seasonality)Medium⏱️ ~2 min
Rolling Statistics and Window Aggregations
Rolling statistics aggregate values over a sliding time window, such as the mean, standard deviation, or median of the last 7 days. They smooth out noise and summarize local context, transforming raw volatile signals into stable features that models can consume. A 7 day rolling mean of daily sales dampens weekend spikes and weekday dips, revealing underlying trends. A 30 minute rolling standard deviation of server latency captures volatility patterns that predict upcoming performance degradation.
Window size drives the tradeoff between responsiveness and stability. Short windows like 6 hours or 1 day react quickly to changes but amplify noise. Long windows like 28 or 90 days are stable but lag behind shifts. Production systems often combine multiple windows. Airbnb's pricing model uses both a 7 day rolling median (captures recent local demand) and a 90 day rolling median (establishes seasonal baseline), then the model learns to weight them based on booking velocity and calendar proximity to holidays.
The key technical challenge is maintaining point in time correctness. A rolling window must not include the current timestamp when computing training features, or you leak the label into the input. If you're predicting end of day sales and compute a 7 day rolling mean at end of day, that mean includes today's sales. The fix is to compute the window strictly before the label time, for example using a cutoff at start of day or t minus epsilon. Online serving must use the same logic: when a prediction request arrives, the feature service returns aggregates over the window ending before that timestamp.
At scale, incremental aggregation keeps compute tractable. For 1 million entities with minute level updates, recomputing full 7 day windows on every event is prohibitive. Instead, maintain per entity state with tumbling or sliding buckets. For a 7 day sum, keep 7 daily buckets plus the current partial bucket. When a new event arrives, add to the current bucket. When a bucket expires, subtract it from the running total. This reduces per event cost from O(window size) to O(1) and keeps memory proportional to the number of buckets, not the number of events.
Real systems set different freshness targets per feature. Uber's demand forecasting updates 7 day rolling averages every 5 minutes, which is sufficient for strategic pricing, but updates 10 minute rolling surge multipliers every 30 seconds to react to sudden spikes. The pipeline uses Apache Flink with keyed state and event time processing, handling up to 100,000 events per second with p99 processing latency under 2 seconds. Storage for rolling aggregates across 10 million ride zones and driver segments is approximately 50 GB, kept in memory for sub 10 millisecond retrieval.
💡 Key Takeaways
•Rolling windows aggregate over sliding time periods like 7 days or 30 minutes, smoothing noise and capturing local trends. Short windows (hours to days) react fast but are noisy, long windows (weeks to months) are stable but lag behind changes
•Production systems combine multiple window sizes to capture different timescales. Airbnb uses 7 day and 90 day rolling medians together, letting the model learn to balance recent signals with seasonal baselines
•Point in time correctness requires computing windows strictly before the label timestamp. A 7 day mean for end of day prediction must exclude today, or training will leak future data that won't exist at inference
•Incremental aggregation with per entity keyed state keeps compute tractable at scale. For a 7 day sum, maintain 7 daily buckets, add to current bucket on new events, subtract expired buckets, reducing per event cost from O(window size) to O(1)
•Freshness targets vary by use case. Uber updates 7 day averages every 5 minutes for pricing but updates 10 minute surge metrics every 30 seconds, processing 100,000 events per second with sub 2 second p99 latency
•For 1 million entities with 10 rolling features over 30 days of daily buckets, storage is approximately 5 GB compressed. Minute level buckets over 7 days require 10 billion values, so systems use dual rate features: per minute for last 6 hours, per hour for older data
📌 Examples
Amazon demand forecasting computes 7 day and 28 day rolling mean and standard deviation of sales per SKU and store, updating every 5 minutes from a stream of 5 million daily events with p99 delay under 2 minutes
Netflix video quality predictor uses 5 minute rolling median throughput and 30 minute rolling 95th percentile latency, maintained in memory per user session, retrieved in under 8 milliseconds to drive adaptive bitrate decisions
Stripe fraud detection uses 1 hour rolling transaction count and 24 hour rolling standard deviation of amounts per merchant, computed incrementally in Apache Flink with 30 second windows and 1 minute lateness allowance