Feature Engineering & Feature StoresBackfilling & Historical FeaturesHard⏱️ ~3 min

Idempotency and Atomic Publication Patterns

Idempotency guarantees that rerunning a backfill job produces identical results without duplicates or inconsistencies. This is critical because large backfills fail mid run due to transient errors, stragglers, or resource limits, and must be restartable without corrupting previously written partitions. The foundation is deterministic upserts keyed by entity id, feature name, feature timestamp, and a monotonic feature version or run id. When duplicates occur (from retries or replays), the system must apply a deterministic tie breaker. Common strategies include selecting the row with the maximum version number or the latest ingestion log sequence number (LSN). Without this, multiple feature values exist for the same entity and timestamp, causing non deterministic training data that shifts model metrics run to run. Uber's Michelangelo uses copy on write or merge on read table patterns with primary key upserts by entity id and timestamp to enforce uniqueness. Atomic publication prevents partial results from being visible to consumers. Writing directly to production tables risks mixed old and new values if a job fails mid run; a training pipeline reading during this window sees inconsistent state. Instead, write to shadow outputs with a unique run id, then atomically swap a pointer (snapshot identifier in Iceberg style tables or partition directory rename). Netflix uses snapshot isolation with atomic pointer swaps to publish backfilled data; if validation fails, the shadow snapshot is simply discarded without affecting production. Validation before publish is non negotiable. Compare offline backfilled values against online computed values for a random sample of entities and timestamps, targeting exact match rates above 99.9% for deterministic features. For approximate features like HyperLogLog distinct counts, bound relative error under 1%. Additionally, validate distributional stability using Kullback Leibler (KL) divergence or Population Stability Index (PSI) between pre backfill and post backfill outputs to detect silent logic errors that shift distributions without throwing exceptions.
💡 Key Takeaways
Idempotency requires deterministic upserts keyed by entity id, feature name, timestamp, and monotonic version; tie breakers use max version or latest ingestion LSN to deduplicate
Atomic publication through shadow tables and pointer swaps prevents partial results from being visible; consumers see consistent snapshots even if backfill fails mid run
Validation compares offline backfilled values against online computed values for sampled entities, targeting greater than 99.9% exact match for deterministic features
Distributional validation using KL divergence or PSI detects silent logic errors that shift feature distributions by 5% or more without throwing exceptions
Uber Michelangelo uses copy on write or merge on read with primary key upserts; Netflix uses Iceberg style snapshot isolation enabling instant rollback by reverting pointers
📌 Examples
A backfill job fails at hour 7 of 10; on restart, previously written partitions are skipped or overwritten deterministically by entity id and timestamp, preventing duplicate feature rows
Shadow backfill writes to features_shadow_run_xyz; after validating 99.96% parity on 50,000 sampled entities, production pointer is atomically swapped to the new snapshot in a single metadata transaction
A feature logic bug causes 8% shift in value distribution; KL divergence of 0.15 (threshold 0.05) fails validation, blocking publication and preventing silent model degradation
← Back to Backfilling & Historical Features Overview