State Carryover and Incremental Backfill Strategies
The Long Window Problem
Long window aggregates like 180 day unique user counts explode backfill cost because naive recomputation scans 180 days of raw events for every training row. For a model training on 12 months of data with daily granularity (365 rows per entity), that is 365 times 180 equals 65,700 partition scans per entity. At scale, this becomes prohibitively expensive.
State Carryover Pattern
Maintain intermediate aggregate state as of each day boundary. To compute the 180 day unique count for day N, start with the state from day N minus 1, add new events from day N, and expire events from day N minus 181. This sliding window approach converts O(window size) per row into O(1) per row after initial state bootstrap.
Incremental Backfill
When features already exist for historical periods but need updating (schema migration, bug fix), compute only the delta rather than full recomputation. Compare new logic output against existing values, identify entities and timestamps where values differ, and update only those rows. This can reduce backfill volume by 90 to 99 percent for small corrections.
Materialized Intermediate Tables
Pre-compute and persist expensive intermediate aggregates that multiple features depend on. A daily active user table computed once supports dozens of derived features without redundant scans. Trade storage cost for compute efficiency when intermediates are reused across many backfills.
Approximation for Extreme Windows
For features with very long lookback (365 day aggregates) where exact computation is cost prohibitive, consider approximate algorithms. HyperLogLog for unique counts, Count Min Sketch for frequency estimation, and sampling for percentiles provide 95 to 99 percent accuracy at 10 to 100x lower cost.
Scheduling Strategy
Orchestrate backfills in dependency order: base aggregates before derived features, dimensions before facts. Use DAG orchestrators like Airflow to manage dependencies and parallelize independent branches.