Feature Engineering & Feature Stores • Backfilling & Historical FeaturesHard⏱️ ~3 min
State Carryover and Incremental Backfill Strategies
Long window aggregates like 180 day unique user counts explode backfill cost because naive recomputation scans 180 days of raw events for every training row. For a model training on 12 months of data with daily granularity (365 rows per entity), that is 365 times 180 equals 65,700 partition scans per entity. At scale, this becomes prohibitively expensive. State carryover reduces complexity from order of history times window to order of window plus target range.
The technique checkpoints rolling aggregate state (counts, sums, distinct estimates) per entity at partition boundaries (typically daily). When backfilling January 1 to March 31, initialize each entity's state from the checkpoint prior to January 1 (e.g., December 31), then incrementally update through the target range. For a 7 day window, you only reprocess 7 days before the start date plus the 90 day target, instead of reprocessing all history. Uber uses this pattern with stateful streaming frameworks to maintain windows at lake scale.
Incremental backfills recompute only changed partitions after logic updates, not full history. If you fix a bug affecting data from February 10 onward, run an incremental backfill from February 10 to present, leaving earlier partitions untouched. This reduces cost by 80% to 90% for localized changes. The risk is residual inconsistencies if change impact is not perfectly localized; for example, a window definition change at February 10 affects windows spanning before and after that date. Teams trade off savings versus correctness, often accepting approximate consistency for low impact features.
Copy on write versus merge on read table formats present a tradeoff. Copy on write rewrites entire files or partitions on updates, simplifying time travel and atomic swaps but increasing write cost. A 10 gigabyte partition with a 1% data change still rewrites all 10 gigabytes. Merge on read stores deltas separately and reconciles on query, minimizing write cost but complicating read paths and increasing query latency during catch up windows. Netflix uses Iceberg style snapshots for atomic publishes; Uber uses Hudi merge on read for continuous ingestion workloads.
💡 Key Takeaways
•State carryover checkpoints rolling aggregate state at partition boundaries, reducing backfill complexity from order of history times window to order of window plus target range
•For 180 day windows, state carryover scans 25 times less data than naive recomputation by initializing from prior checkpoint instead of reprocessing full history per row
•Incremental backfills recompute only changed partitions after logic updates, saving 80% to 90% cost for localized changes but risking inconsistencies if change impact spans boundaries
•Copy on write rewrites entire 10 gigabyte partitions even for 1% changes, simplifying time travel but increasing write cost; merge on read stores deltas, reducing writes but complicating reads
•Uber uses stateful streaming with checkpointed windows at lake scale (hundreds of terabytes per day); Netflix uses Iceberg snapshots for atomic partition swaps
📌 Examples
A 180 day unique user count feature over 12 months of training data (365 rows per entity) requires 65,700 partition scans per entity naively; state carryover reduces to 180 plus 365 equals 545 scans
Feature bug fixed on February 10; incremental backfill from February 10 to March 31 reprocesses 49 days instead of full 365 day history, reducing cost from $4,000 to $600
Uber Hudi tables use merge on read for continuous ingestion, deferring compaction to off peak windows; this reduces write latency from 10 seconds (copy on write) to 2 seconds but increases read query time by 20%