Failure Modes: Watermark Bugs and Duplicate Writes

Why Incremental Systems Fail Differently:

Full batch jobs fail obviously: the job crashes, no output is written, and you rerun everything. Incremental systems fail silently and insidiously. You lose a small percentage of data, or you double count metrics, and you do not notice until someone queries an anomaly weeks later. Understanding these failure modes is critical for interview discussions and production reliability.

Failure Mode 1: Watermark Advancement Data Loss:

Consider a timestamp based incremental load. You query the source table for rows where updated_at is greater than your stored watermark, transform the data, write to your target, and then update the watermark to the current wall clock time.

The failure: your OLTP database has a 10 second replication lag between primary and replica. At 10:00:00 AM, you query the replica and see updates up to 9:59:50 AM. You process those rows and store watermark as 10:00:00 AM. A legitimate update at 9:59:55 AM is still replicating and does not appear in your query. Your next run at 10:15:00 AM queries for rows after 10:00:00 AM, permanently skipping the 9:59:55 AM update.

Data Loss Timeline
9:59:50 AM
Last Row
→
9:59:55 AM
Still Replicating
→
10:00:00 AM
Watermark Set

The fix: always lag your watermark behind current time by a safe margin. Process rows where updated_at is less than (current time minus 30 minutes), giving ample buffer for replication lag, clock skew, and delayed writes. Monitor actual lag and alert if it exceeds your buffer.

Failure Mode 2: Duplicate Writes from Offset Commit Ordering:

Your Kafka consumer reads 10,000 events, transforms them, writes to your database, and then commits the new offset. The failure: database write succeeds but your consumer crashes before committing the offset. On restart, it reprocesses the same 10,000 events.

If your writes are not idempotent, you now have duplicates. An event counting pipeline increments a counter twice. A financial ledger records the same transaction twice. Metrics drift over time, and subtle bugs compound.

❗ Remember: The only safe pattern is exactly once semantics: either commit offset and write in a single distributed transaction, or make writes idempotent using unique keys plus upsert logic.

Idempotency typically relies on natural business keys (transaction ID, event ID) plus a deterministic output schema. Insert or update based on the key: if the row already exists with identical values, the second write is a no op. Systems like Kafka Streams and Flink provide transactional output that commits offsets and writes atomically, but this requires careful configuration and compatible sinks.

Failure Mode 3: Schema Evolution Silent Data Loss:

Your source system adds a new column to events: payment_method. Your incremental pipeline infers schema from the first batch of data it sees. If it processes an old partition first, it builds a schema without payment_method. New events with that field arrive later, but the pipeline silently drops the unknown column because the schema is already locked.

Downstream analysts wonder why payment_method is always null in recent data. The fix: use schema registries (Avro, Protobuf) with explicit versioning, or enforce schema on write at the source. Validate that new fields appear in downstream tables within expected SLAs.

Failure Mode 4: Unbounded State in Stateful Streaming:

A streaming job tracks active user sessions to compute session duration. It maintains state for every active session ID. If sessions are long lived (days or weeks) or if bugs cause sessions to never close, state grows unbounded.

At 1 million concurrent sessions with 10 KB state each, you accumulate 10 GB. At 100 million sessions (due to a bug not expiring old sessions), you hit 1 TB and run out of memory. The job crashes, and restarting with corrupted or oversized checkpoints becomes a multi hour recovery effort.

The fix: always use time based windowing or TTL (Time To Live) policies on state. Expire session state after 24 hours regardless of activity. Monitor state size as a first class metric and alert when it grows unexpectedly.

💡 Key Takeaways

✓Watermark advancement without lag buffers causes permanent data loss: process up to (current time minus 30 minutes) to handle replication lag and clock skew, not current time

✓Duplicate writes occur when offset commits happen before writes are durable: use exactly once semantics (transactional commits) or idempotent writes with unique keys and upsert logic

✓Schema evolution silently drops new columns if pipelines infer schema from old data: use schema registries with explicit versioning and monitor for new fields in downstream tables

✓Unbounded state in streaming jobs (tracking millions of sessions without TTL) leads to out of memory crashes and corrupted checkpoints requiring multi hour recovery

📌 Interview Tips

1Production incident: incremental job set watermark to current time, but database replica lagged 15 seconds. Legitimate updates in that window (roughly 5000 rows per day) were permanently lost until detected 3 weeks later via dashboard anomaly

2Kafka consumer processed 1 million events, wrote to Postgres, crashed before committing offset. Reprocessed same events, doubled aggregated revenue metrics by $500k until idempotent upsert logic was added using transaction ID as primary key

← Back to Incremental Processing Patterns Overview