Feature Engineering & Feature Stores • Feature Transformation Pipelines (Spark, Flink)Hard⏱️ ~3 min
Exactly Once Semantics: Idempotency, Checkpoints, and Sink Guarantees
Exactly once end to end semantics ensure that each event affects final feature values exactly once, even across failures and retries. This requires three components: stateful operator checkpointing, idempotent or transactional sinks, and coordinated recovery. Without exactly once, failures cause duplicate or missing feature updates. A rolling count feature might double-count events after recovery (at least once) or miss events (at most once), producing wrong training data and degrading model accuracy.
Checkpoints snapshot all operator state and event stream positions to durable storage every 10 to 60 seconds. When a task fails, the pipeline restarts from the last successful checkpoint, replaying events from that point. For exactly once, the checkpoint must be a consistent global snapshot: all operators checkpoint at the same logical stream position (using barriers injected into the stream). With 500GB state and 30 second checkpoint interval, a full checkpoint writes 500GB in 30 seconds, requiring 130 MB per second storage bandwidth. Incremental checkpoints reduce this by only writing changed state.
Sinks must support idempotent writes or transactions. An idempotent sink like upsert to a key value store (write user_42 count = 10) produces the same result if retried: replaying after checkpoint writes the same value again. A transactional sink like two-phase commit to a database coordinates with checkpoints to commit writes atomically only after the checkpoint completes. Non-idempotent sinks like append-only logs or increment operations (count += 1) break exactly once: replays cause duplicates.
The cost is throughput overhead (5 to 15% from checkpoint pauses and coordination) and increased failure sensitivity. If checkpoint storage is slow or tasks are unbalanced, checkpoints may take longer than the interval, causing checkpoint timeouts and repeated failures. At least once mode is simpler and faster but requires downstream deduplication or tolerates approximate features (slight overcounts acceptable).
💡 Key Takeaways
•Exactly once requires coordinated checkpoints every 10 to 60 seconds plus idempotent or transactional sinks. Without both, failures cause duplicate or missing feature updates that degrade model accuracy
•Checkpoint barriers injected into streams ensure consistent global snapshots: all operators checkpoint at the same logical stream position. With 500GB state, full checkpoint writes 130 MB per second to remote storage
•Idempotent sinks like key value upserts (user_42 count = 10) produce same result on replay. Non-idempotent sinks like append or increment (count += 1) break exactly once, causing duplicates on recovery
•Exactly once adds 5 to 15% throughput overhead from checkpoint coordination and pauses. At least once is simpler and faster but requires downstream dedup or tolerates approximate features
•Incremental checkpoints only write changed state, reducing checkpoint time from 5 minutes to under 1 minute for 500GB state with 20% change rate per interval
•Transactional sinks use two-phase commit: pre-commit during checkpoint, commit after checkpoint completes. Kafka transactional producer example: writes visible only after checkpoint barrier
📌 Examples
Uber trip features: Exactly once pipeline writes per-driver rolling features to Cassandra with upsert (idempotent). Checkpoint every 30 seconds. Recovery replays from last checkpoint without duplicate counts. Overhead: 8% throughput vs at least once.
LinkedIn activity counts: At least once pipeline for approximate engagement counts. Accepts 0.1% overcount from duplicates after failures. Simpler ops, 15% higher throughput, no checkpoint coordination overhead.
Netflix viewing aggregates: Exactly once with incremental checkpoints to S3. State: 400GB. Full checkpoint: 4 minutes. Incremental: 50 seconds (15% state change per interval). Recovery time: 2 minutes. Transactional writes to Delta Lake ensure atomic commits.
Airbnb booking features: Pipeline failure during peak caused 2x click counts with at least once mode, biasing Click Through Rate (CTR) model training by 5%. Switched to exactly once with idempotent Redis upserts. Overhead: 10% throughput, eliminated training data corruption.