Failure Modes and Edge Cases in Idempotent Pipelines

Late and Out of Order Data:
The most common failure mode is events arriving outside your deduplication window. Imagine you deduplicate based on a 1 hour window of seen event_id values. A mobile client goes offline for 3 hours, then sends buffered events when reconnected. Those events are outside your dedupe window and will be counted again if the pipeline replays.

This isn't theoretical. At companies with mobile apps used in areas with poor connectivity, 2 to 5% of events can be delayed by hours or days. You must define acceptable lateness and retention for dedupe state explicitly. For example, maintain 24 hours of dedupe keys in memory or fast storage. Accept that events delayed beyond 24 hours might duplicate if a replay happens.

Late Data Impact
ON TIME
95%
→
1-4 HR LATE
4%
→
BEYOND WINDOW
1%
Key Collision and Non Unique Keys:
A subtle but devastating failure is when your idempotency key is not actually unique. Suppose you use user_id plus timestamp as the key, and timestamp has only second precision. Two events from the same user in the same second will collide, and one will silently overwrite the other.

Real production incident: a team used session_id as the idempotency key, assuming it was unique per session. Turns out the mobile SDK reused session IDs after app restart under certain conditions. Different logical sessions had the same session_id, causing one session's data to overwrite another's. Silent data loss that took weeks to discover.

The fix: use composite keys with truly unique components. For example, producer_id plus monotonic_sequence_number or a proper UUID (version 4). Never use timestamps alone or client generated values without validation.

Partial Failures Across Multiple Sinks:
A streaming job writes to both a Postgres database and a Redis cache. The Postgres write succeeds, but Redis times out. The job crashes, restarts, and reprocesses the batch. Now Postgres receives an idempotent upsert (fine), but Redis gets a duplicate write because it's not using the same key based upsert logic.

❗ Remember: Idempotency must be consistent across all sinks in a transaction boundary. If you write to multiple systems, either make all of them idempotent with the same key, or use distributed transactions or the outbox pattern.

Another variation: a batch job writes to a data warehouse (idempotent upsert by order_id) and emits metrics to a monitoring system (simple counter increment). On retry, the warehouse is fine but the metrics are double counted. You need to make the metrics system idempotent too, perhaps by including a unique job execution ID and deduplicating on the metrics ingestion side.

Backfills with Changed Logic:
You fix a bug in your revenue calculation logic and backfill the last month of data. The pipeline is idempotent, so it safely overwrites the old data. But what if the new logic produces results that conflict with downstream assumptions?

For example, the old logic computed revenue in cents, new logic in dollars. Downstream ML models trained on the old scale break. Or the old logic included refunds in revenue, new logic doesn't, but reports and dashboards were built expecting the old definition.

This is idempotency at the data level but semantic non idempotency. The fix involves schema versioning and migration strategies. Add a schema_version or computation_version field to output tables. Downstream systems can filter or adapt based on version. When backfilling with new logic, either write to a new versioned partition or ensure downstream systems handle both versions.

State Store Overload:
At very high scale, stateful deduplication can become a bottleneck. If you store dedupe keys in a single shared Redis or RocksDB instance, a large replay or spike can overwhelm it. Requests timeout, triggering retries, creating a cascading failure.

The incident pattern: a backfill job processes 10 hours of data, generating 7 billion events. The dedupe state store expects 200k events per second but suddenly receives 2 million per second. Query latency spikes from 2ms to 500ms. The pipeline times out and retries, making it worse.

Mitigation strategies: shard the dedupe state by key hash across multiple instances. Use approximate structures like Bloom filters for older time windows to reduce memory pressure. Implement backpressure and rate limiting so replays don't overwhelm the system. Monitor state store latency and size as critical production metrics.

💡 Key Takeaways

✓Late data beyond your dedupe window (for example, 3 hours when you store 1 hour) can bypass deduplication and create duplicates on replay

✓Key collisions from non unique identifiers (like second precision timestamps) cause silent data loss as one event overwrites another

✓Partial failures across multiple sinks break idempotency unless all sinks use consistent key based upserts or transactional writes

✓Backfills with changed computation logic are idempotent at the data level but can break downstream systems expecting old semantics

✓Stateful dedupe stores can become bottlenecks during replays: 7 billion event backfill overwhelming a state store expecting 200k/sec sustained rate

📌 Interview Tips

1Late data incident: Mobile app buffers events for 6 hours offline. Dedupe window is 2 hours. When pipeline replays, those events duplicate because state has been evicted. Fix: extend window to 24 hours or use persistent dedupe store.

2Key collision bug: Using <code>user_id</code> + <code>UNIX_TIMESTAMP(event_time)</code> as key. Two clicks in same second collide. Second event overwrites first, losing a click. Fix: use <code>user_id</code> + <code>event_id</code> (UUID) instead.

3Partial failure: Streaming job writes to Snowflake (idempotent) and Datadog metrics (append only counter). Job retries after Datadog timeout. Snowflake data is correct, but Datadog shows 2x the actual event count. Fix: include <code>job_run_id</code> in metrics and dedupe on ingestion.

← Back to Idempotency in Data Pipelines Overview