Feature Engineering & Feature StoresBackfilling & Historical FeaturesHard⏱️ ~3 min

State Carryover and Incremental Backfill Strategies

The Long Window Problem

Long window aggregates like 180 day unique user counts explode backfill cost because naive recomputation scans 180 days of raw events for every training row. For a model training on 12 months of data with daily granularity (365 rows per entity), that is 365 times 180 equals 65,700 partition scans per entity. At scale, this becomes prohibitively expensive.

State Carryover Pattern

Maintain intermediate aggregate state as of each day boundary. To compute the 180 day unique count for day N, start with the state from day N minus 1, add new events from day N, and expire events from day N minus 181. This sliding window approach converts O(window size) per row into O(1) per row after initial state bootstrap.

Incremental Backfill

When features already exist for historical periods but need updating (schema migration, bug fix), compute only the delta rather than full recomputation. Compare new logic output against existing values, identify entities and timestamps where values differ, and update only those rows. This can reduce backfill volume by 90 to 99 percent for small corrections.

Materialized Intermediate Tables

Pre-compute and persist expensive intermediate aggregates that multiple features depend on. A daily active user table computed once supports dozens of derived features without redundant scans. Trade storage cost for compute efficiency when intermediates are reused across many backfills.

Approximation for Extreme Windows

For features with very long lookback (365 day aggregates) where exact computation is cost prohibitive, consider approximate algorithms. HyperLogLog for unique counts, Count Min Sketch for frequency estimation, and sampling for percentiles provide 95 to 99 percent accuracy at 10 to 100x lower cost.

Scheduling Strategy

Orchestrate backfills in dependency order: base aggregates before derived features, dimensions before facts. Use DAG orchestrators like Airflow to manage dependencies and parallelize independent branches.

💡 Key Takeaways
State carryover checkpoints rolling aggregate state at partition boundaries, reducing backfill complexity from order of history times window to order of window plus target range
For 180 day windows, state carryover scans 25 times less data than naive recomputation by initializing from prior checkpoint instead of reprocessing full history per row
Incremental backfills recompute only changed partitions after logic updates, saving 80% to 90% cost for localized changes but risking inconsistencies if change impact spans boundaries
Copy on write rewrites entire 10 gigabyte partitions even for 1% changes, simplifying time travel but increasing write cost; merge on read stores deltas, reducing writes but complicating reads
Uber uses stateful streaming with checkpointed windows at lake scale (hundreds of terabytes per day); Netflix uses Iceberg snapshots for atomic partition swaps
📌 Interview Tips
1A 180 day unique user count feature over 12 months of training data (365 rows per entity) requires 65,700 partition scans per entity naively; state carryover reduces to 180 plus 365 equals 545 scans
2Feature bug fixed on February 10; incremental backfill from February 10 to March 31 reprocesses 49 days instead of full 365 day history, reducing cost from $4,000 to $600
3Uber Hudi tables use merge on read for continuous ingestion, deferring compaction to off peak windows; this reduces write latency from 10 seconds (copy on write) to 2 seconds but increases read query time by 20%
← Back to Backfilling & Historical Features Overview