Feature Engineering & Feature StoresBackfilling & Historical FeaturesHard⏱️ ~3 min

Idempotency and Atomic Publication Patterns

Why Idempotency Matters

Idempotency guarantees that rerunning a backfill job produces identical results without duplicates or inconsistencies. This is critical because large backfills fail mid run due to transient errors, stragglers, or resource limits, and must be restartable without corrupting previously written partitions.

Deterministic Upserts

The foundation is deterministic upserts keyed by entity ID, feature name, and timestamp. Instead of appending rows, each write overwrites any existing row with the same key. Running the job twice produces identical output. Delta Lake and Hudi provide MERGE operations that implement this pattern efficiently.

Partition Level Atomicity

Write backfill output to staging partitions, validate completeness and correctness, then atomically swap staging into production. This prevents partial writes from being visible to downstream consumers. If validation fails, discard staging and retry without affecting production tables.

Checkpointing Strategy

For multi hour backfills, checkpoint progress at partition boundaries. If a job fails after completing partitions 1 through 50, the restart should skip those partitions and resume from partition 51. Store checkpoint state in a durable metadata store separate from the output tables.

Conflict Resolution

When multiple backfill jobs write to overlapping partitions (rare but possible during migrations), define deterministic conflict resolution: latest writer wins, highest version wins, or fail loudly. Ambiguous merge semantics cause silent data corruption that surfaces months later as unexplained model degradation.

Audit Trail

Log backfill job metadata: start time, end time, input data ranges, output partition counts, row counts, and checksums. This audit trail enables debugging when downstream consumers report data quality issues.

💡 Key Takeaways
Idempotency requires deterministic upserts keyed by entity id, feature name, timestamp, and monotonic version; tie breakers use max version or latest ingestion LSN to deduplicate
Atomic publication through shadow tables and pointer swaps prevents partial results from being visible; consumers see consistent snapshots even if backfill fails mid run
Validation compares offline backfilled values against online computed values for sampled entities, targeting greater than 99.9% exact match for deterministic features
Distributional validation using KL divergence or PSI detects silent logic errors that shift feature distributions by 5% or more without throwing exceptions
Uber Michelangelo uses copy on write or merge on read with primary key upserts; Netflix uses Iceberg style snapshot isolation enabling instant rollback by reverting pointers
📌 Interview Tips
1A backfill job fails at hour 7 of 10; on restart, previously written partitions are skipped or overwritten deterministically by entity id and timestamp, preventing duplicate feature rows
2Shadow backfill writes to features_shadow_run_xyz; after validating 99.96% parity on 50,000 sampled entities, production pointer is atomically swapped to the new snapshot in a single metadata transaction
3A feature logic bug causes 8% shift in value distribution; KL divergence of 0.15 (threshold 0.05) fails validation, blocking publication and preventing silent model degradation
← Back to Backfilling & Historical Features Overview
Idempotency and Atomic Publication Patterns | Backfilling & Historical Features - System Overflow