Loading...
Data Pipelines & Orchestration • Idempotency in Data PipelinesMedium⏱️ ~3 min
How Idempotency Works: Keys and Upserts
The Mechanism:
Idempotency relies on two core techniques working together: stable identifiers that uniquely represent business events, and upsert semantics at the sink that overwrite rather than append.
Real Numbers at Scale:
Consider Stripe's payment processing. When a client retries a payment API call due to network timeout, they include an
1
Producers attach stable keys: Every event gets an identifier that stays constant across retries. For an order event, this might be
order_id plus line_item_id. For a user action, user_id plus event_type plus timestamp.2
Sinks use upsert logic: Instead of INSERT, the sink does INSERT ON CONFLICT UPDATE or MERGE. If a row with that key exists, overwrite it. If not, insert new. Either way, running twice produces the same final row.
3
State reflects business reality: The database or data warehouse now represents each business event exactly once, regardless of how many times the pipeline ran.
idempotency_key. The backend stores results keyed by this value. A second request with the same key returns the cached result instead of charging twice. This handles billions of API requests per year with zero duplicate charges.
Data pipelines use the same principle. An ads analytics platform ingesting 200,000 events per second might see 10,000 to 20,000 duplicates per second from client retries. With event_id based deduplication, those 10,000 duplicates are safely ignored or overwritten, maintaining accurate click counts.
For Aggregates:
When building aggregates like daily revenue per merchant, the idempotency key becomes a composite: merchant_id plus date. A daily batch job can be rerun to fix bugs or handle late data. It fully recomputes revenue for that merchant and date, then upserts the result. The previous value is replaced, not added to.
⚠️ Common Pitfall: Using auto incrementing IDs or timestamps as your idempotency key breaks everything. These change on every run, so reprocessing creates new rows instead of updating existing ones.
💡 Key Takeaways
✓Stable identifiers like <code>order_id</code> or <code>event_id</code> must stay constant across retries and replays, not change per attempt
✓Upsert semantics (INSERT ON CONFLICT UPDATE, MERGE) ensure reprocessing overwrites existing rows instead of creating duplicates
✓For aggregates, composite keys like <code>merchant_id</code> plus <code>date</code> enable safe recomputation of time bucketed metrics
✓At 200,000 events per second with 5% duplication rate, idempotent processing handles 10,000 duplicate events per second transparently
✓Keys must represent business identity, not physical artifacts like auto incrementing IDs or processing timestamps
📌 Examples
1Postgres upsert: <code>INSERT INTO orders (order_id, amount) VALUES (12345, 50) ON CONFLICT (order_id) DO UPDATE SET amount = EXCLUDED.amount</code>
2Spark Delta Lake merge: <code>deltaTable.alias('target').merge(updates.alias('source'), 'target.event_id = source.event_id').whenMatched().updateAll().whenNotMatched().insertAll()</code>
3Stripe API idempotency: Client sends <code>Idempotency-Key: abc123</code> header. Server caches response by this key. Retry with same key returns cached result, preventing double charge.
Loading...