Time Series ForecastingFeature Engineering (Lag Features, Rolling Stats, Seasonality)Hard⏱️ ~3 min

Feature Pipeline Architecture and Operational Patterns

A production time series feature pipeline has two parallel paths: an offline batch pipeline that backfills historical features for training, and an online streaming pipeline that maintains fresh features for real time inference. These paths must produce identical feature values given the same input data, or training serving skew degrades production accuracy. Achieving consistency requires shared feature definitions, strict point in time correctness, and comprehensive monitoring. The offline path processes historical data to create training datasets. For a retail demand forecast with 50,000 products and 730 days of history, the pipeline reads raw sales events, groups by product and day, computes lag features (t minus 1, t minus 7, t minus 28), rolling aggregates (7 day and 28 day mean and standard deviation), and calendar features (day of week, holiday flags). Each feature is computed with a cutoff timestamp before the label time to prevent leakage. The output is a table with one row per product and day, columns for label and features, and a timestamp. This process runs daily or weekly, materializing millions of training examples. Airbnb processes 200 million booking events nightly to produce features for pricing and ranking models, completing in under 4 hours on a Spark cluster with 500 cores. The online path maintains fresh features in a low latency store. It ingests events from a stream (Kafka, Kinesis), updates per entity aggregates incrementally, and serves feature vectors on demand. For rolling aggregates, it uses keyed state with tumbling or sliding windows. When a sales event for product P arrives, the pipeline updates P's current day bucket, recomputes 7 day and 28 day sums, and writes the new values to a key value store (Redis, DynamoDB, Cassandra). At inference time, the model service queries the store by product key, retrieving the feature vector in under 10 milliseconds at p95. Uber's feature store serves 100,000 queries per second across millions of entities, maintaining p99 latency under 25 milliseconds with in memory caching and read replicas. Shared feature definitions are critical for consistency. Each feature is declared once with entity keys, time field, aggregation function, window size, and freshness SLA. This definition drives both batch backfill (scan historical events, group, aggregate) and stream updates (maintain state, apply aggregation). Platforms like Uber Michelangelo and Airbnb Zipline provide domain specific languages or configuration files for feature definitions. Engineers declare what they want computed; the platform generates both batch and streaming jobs automatically, eliminating manual translation errors. Monitoring detects divergence and staleness. Track feature freshness (time since last update), null rates (fraction of missing values), and distribution drift (compare current values to historical quantiles). Run shadow jobs that recompute online features from offline data on sampled keys and measure relative error. If more than 5 percent of features diverge by more than 10 percent, alert and investigate. Common causes include late data handling differences, floating point precision, time zone bugs, or incremental aggregation logic errors. Netflix monitors feature staleness per user, alerting if more than 1 percent of users have features older than 10 minutes, which would degrade recommendation relevance. Storage and scale considerations shape architecture choices. For 1 million entities with 10 features each, 30 days of daily history, and 16 bytes per value, storage is approximately 5 GB compressed. Minute level granularity over 7 days explodes to 10 billion values. Systems use tiered storage: per minute buckets for the last 6 hours (high value, high cost), per hour buckets for 1 to 7 days (moderate), per day buckets beyond that (low cost). Retention policies expire old data. Compression (Snappy, LZ4) and columnar formats (Parquet) reduce costs further. For serving, in memory caching (Redis, Memcached) keeps hot entities under 10 milliseconds, with fallback to slower stores (Cassandra, DynamoDB) for cold entities, accepting 50 to 100 millisecond latency for the long tail.
💡 Key Takeaways
Offline batch pipelines backfill historical features for training, processing months or years of data with point in time joins. Airbnb processes 200 million events nightly on Spark (500 cores, 4 hours) to create training datasets for pricing and ranking models
Online streaming pipelines maintain fresh features for real time inference using incremental aggregation in systems like Apache Flink or Kafka Streams. Uber serves 100,000 feature queries per second with p99 latency under 25 milliseconds from in memory stores
Shared feature definitions declared once drive both batch and streaming jobs, eliminating manual translation. Platforms like Uber Michelangelo and Airbnb Zipline generate code from declarative configs, ensuring consistency and preventing training serving skew
For 1 million entities with 10 features and 30 days of daily history, storage is approximately 5 GB compressed. Minute level granularity over 7 days requires 10 billion values, so systems use tiered storage: per minute for last 6 hours, per hour for 1 to 7 days, per day beyond
Monitor feature freshness (time since last update), null rates (missing values), and distribution drift (current vs historical quantiles). Run shadow scoring to recompute online features from offline data on sampled keys, alerting if more than 5 percent diverge by over 10 percent
Latency targets vary by feature importance. Uber allows 30 seconds for traffic features (high value) but 5 minutes for rolling averages (lower urgency), using different stream processing topologies and caching strategies to meet SLAs while controlling cost
📌 Examples
Amazon retail demand forecasting backfills 730 days of sales for 50,000 SKUs offline, computing lag and rolling features with point in time cutoffs. Online, a Flink pipeline updates 7 day rolling means within 5 minutes of new sales events, serving predictions at 3,000 QPS with p95 latency under 15 milliseconds
Netflix recommendation pipeline materializes hourly feature snapshots offline for training, then serves real time features from Redis (10 millisecond p95) for homepage personalization. Shadow scoring compares 1 percent of online features to offline recomputed values, alerting if Mean Absolute Error (MAE) exceeds 2 percent
Uber ETA feature store processes 500 million trip events daily, maintaining per road segment rolling traffic statistics in DynamoDB with TTL expiration after 30 days. In memory cache (Redis) handles 80 percent of reads under 5 milliseconds, with DynamoDB fallback at 20 milliseconds for cold segments
← Back to Feature Engineering (Lag Features, Rolling Stats, Seasonality) Overview
Feature Pipeline Architecture and Operational Patterns | Feature Engineering (Lag Features, Rolling Stats, Seasonality) - System Overflow