Loading...
Data Pipelines & OrchestrationIdempotency in Data PipelinesMedium⏱️ ~3 min

How Idempotency Works: Keys and Upserts

The Mechanism: Idempotency relies on two core techniques working together: stable identifiers that uniquely represent business events, and upsert semantics at the sink that overwrite rather than append.
1
Producers attach stable keys: Every event gets an identifier that stays constant across retries. For an order event, this might be order_id plus line_item_id. For a user action, user_id plus event_type plus timestamp.
2
Sinks use upsert logic: Instead of INSERT, the sink does INSERT ON CONFLICT UPDATE or MERGE. If a row with that key exists, overwrite it. If not, insert new. Either way, running twice produces the same final row.
3
State reflects business reality: The database or data warehouse now represents each business event exactly once, regardless of how many times the pipeline ran.
Real Numbers at Scale: Consider Stripe's payment processing. When a client retries a payment API call due to network timeout, they include an idempotency_key. The backend stores results keyed by this value. A second request with the same key returns the cached result instead of charging twice. This handles billions of API requests per year with zero duplicate charges. Data pipelines use the same principle. An ads analytics platform ingesting 200,000 events per second might see 10,000 to 20,000 duplicates per second from client retries. With event_id based deduplication, those 10,000 duplicates are safely ignored or overwritten, maintaining accurate click counts. For Aggregates: When building aggregates like daily revenue per merchant, the idempotency key becomes a composite: merchant_id plus date. A daily batch job can be rerun to fix bugs or handle late data. It fully recomputes revenue for that merchant and date, then upserts the result. The previous value is replaced, not added to.
⚠️ Common Pitfall: Using auto incrementing IDs or timestamps as your idempotency key breaks everything. These change on every run, so reprocessing creates new rows instead of updating existing ones.
💡 Key Takeaways
Stable identifiers like <code>order_id</code> or <code>event_id</code> must stay constant across retries and replays, not change per attempt
Upsert semantics (INSERT ON CONFLICT UPDATE, MERGE) ensure reprocessing overwrites existing rows instead of creating duplicates
For aggregates, composite keys like <code>merchant_id</code> plus <code>date</code> enable safe recomputation of time bucketed metrics
At 200,000 events per second with 5% duplication rate, idempotent processing handles 10,000 duplicate events per second transparently
Keys must represent business identity, not physical artifacts like auto incrementing IDs or processing timestamps
📌 Examples
1Postgres upsert: <code>INSERT INTO orders (order_id, amount) VALUES (12345, 50) ON CONFLICT (order_id) DO UPDATE SET amount = EXCLUDED.amount</code>
2Spark Delta Lake merge: <code>deltaTable.alias('target').merge(updates.alias('source'), 'target.event_id = source.event_id').whenMatched().updateAll().whenNotMatched().insertAll()</code>
3Stripe API idempotency: Client sends <code>Idempotency-Key: abc123</code> header. Server caches response by this key. Retry with same key returns cached result, preventing double charge.
← Back to Idempotency in Data Pipelines Overview
Loading...
How Idempotency Works: Keys and Upserts | Idempotency in Data Pipelines - System Overflow