Data Processing Patterns • ETL Pipelines & Data IntegrationHard⏱️ ~2 min
Idempotency, Deduplication, and Exactly Once Illusions in Distributed Pipelines
Exactly-once delivery across distributed systems is an illusion. Without idempotent sinks and deterministic identifiers, retries and rebalancing introduce duplicates or gaps. The key is designing pipelines where reprocessing the same input produces the same output without side effects.
Attach deterministic primary keys and sequence numbers to every event. For Change Data Capture (CDC), use source table primary key plus Log Sequence Number (LSN). For application events, combine a globally unique event identifier with a monotonic timestamp or sequence per producer. Downstream sinks perform upserts or merges keyed by these identifiers, dropping changes with older sequence numbers. This makes repeated delivery safe: applying the same change twice has no additional effect.
For streaming systems, maintain a bounded deduplication cache per partition keyed by (event ID, sequence). Size the cache using arrival rate and maximum expected reordering window. If p99 reordering is 5 minutes and you receive 1,000 events per second, cache the last 300,000 event IDs per partition (5 minutes × 60 seconds × 1,000 events). Expire cache entries older than the watermark plus a safety buffer. This prevents memory exhaustion while catching duplicates within the reordering window.
Cross-system exactly-once is harder. A producer may write to a stream successfully but crash before committing the write offset, causing a duplicate on retry. Transactional frameworks coordinate stream writes and offset commits atomically, but not all sinks support this. The pragmatic approach is idempotent sinks plus monitoring for duplicate rates and reconciliation jobs that check row counts and checksums between source and destination.
Amazon pipelines enforce idempotency at every stage. Ingestion assigns deterministic keys; transformation jobs checkpoint source offsets and watermark state; sinks upsert by key. Backfill jobs reprocess overlapping time windows safely because all operations are idempotent. Monitoring tracks duplicate event rates and alerts if they exceed thresholds, triggering investigation of producer bugs or misconfigured retries.
💡 Key Takeaways
•Exactly-once delivery across systems is not guaranteed. Design for idempotency: reprocessing the same input must produce the same output without additional side effects.
•Attach deterministic primary keys and monotonic sequence numbers (e.g., CDC: source PK + LSN, events: UUID + timestamp). Sinks upsert by key and drop older sequences.
•For streaming, maintain a bounded dedup cache per partition sized by arrival rate and reordering window. Example: 1k events/s, 5 min reorder window = cache last 300k IDs per partition.
•Cross-system exactly-once requires transactional coordination (atomic write + offset commit). Pragmatic fallback: idempotent sinks plus reconciliation jobs checking row counts and checksums.
•Amazon pattern: deterministic keys at ingress, checkpointed offsets and watermarks in jobs, upsert sinks. Backfills reprocess overlapping windows safely. Monitor duplicate rates and alert on anomalies.
📌 Examples
CDC idempotency: event carries (order_id=12345, LSN=5001, op=UPDATE). Sink sees LSN=5001 already applied, drops duplicate. Later change with LSN=5002 is applied; out-of-order LSN=5000 is dropped.
Dedup cache sizing: at 2k events/s with p99 reorder of 10 minutes, cache 2000 × 600 = 1.2M event IDs per partition. Expire entries older than watermark + 15 min safety buffer to prevent memory bloat.