Learn→Feature Engineering & Feature Stores→Backfilling & Historical Features→6 of 6

Feature Engineering & Feature Stores • Backfilling & Historical FeaturesHard⏱️ ~3 min

State Carryover and Incremental Backfill Strategies

The Long Window Problem
Long window aggregates like 180 day unique user counts explode backfill cost because naive recomputation scans 180 days of raw events for every training row. For a model training on 12 months of data with daily granularity (365 rows per entity), that is 365 times 180 equals 65,700 partition scans per entity. At scale, this becomes prohibitively expensive.
State Carryover Pattern
Maintain intermediate aggregate state as of each day boundary. To compute the 180 day unique count for day N, start with the state from day N minus 1, add new events from day N, and expire events from day N minus 181. This sliding window approach converts O(window size) per row into O(1) per row after initial state bootstrap.
Incremental Backfill
When features already exist for historical periods but need updating (schema migration, bug fix), compute only the delta rather than full recomputation. Compare new logic output against existing values, identify entities and timestamps where values differ, and update only those rows. This can reduce backfill volume by 90 to 99 percent for small corrections.
Materialized Intermediate Tables
Pre-compute and persist expensive intermediate aggregates that multiple features depend on. A daily active user table computed once supports dozens of derived features without redundant scans. Trade storage cost for compute efficiency when intermediates are reused across many backfills.
Approximation for Extreme Windows
For features with very long lookback (365 day aggregates) where exact computation is cost prohibitive, consider approximate algorithms. HyperLogLog for unique counts, Count Min Sketch for frequency estimation, and sampling for percentiles provide 95 to 99 percent accuracy at 10 to 100x lower cost.
Scheduling Strategy
Orchestrate backfills in dependency order: base aggregates before derived features, dimensions before facts. Use DAG orchestrators like Airflow to manage dependencies and parallelize independent branches.

💡 Key Takeaways

✓State carryover checkpoints rolling aggregate state at partition boundaries, reducing backfill complexity from order of history times window to order of window plus target range

✓For 180 day windows, state carryover scans 25 times less data than naive recomputation by initializing from prior checkpoint instead of reprocessing full history per row

✓Incremental backfills recompute only changed partitions after logic updates, saving 80% to 90% cost for localized changes but risking inconsistencies if change impact spans boundaries

✓Copy on write rewrites entire 10 gigabyte partitions even for 1% changes, simplifying time travel but increasing write cost; merge on read stores deltas, reducing writes but complicating reads

✓Uber uses stateful streaming with checkpointed windows at lake scale (hundreds of terabytes per day); Netflix uses Iceberg snapshots for atomic partition swaps

📌 Interview Tips

1A 180 day unique user count feature over 12 months of training data (365 rows per entity) requires 65,700 partition scans per entity naively; state carryover reduces to 180 plus 365 equals 545 scans

2Feature bug fixed on February 10; incremental backfill from February 10 to March 31 reprocesses 49 days instead of full 365 day history, reducing cost from $4,000 to $600

3Uber Hudi tables use merge on read for continuous ingestion, deferring compaction to off peak windows; this reduces write latency from 10 seconds (copy on write) to 2 seconds but increases read query time by 20%

← Back to Backfilling & Historical Features Overview