Failure Modes: Backpressure, Lag, and Late Events

Backpressure and Consumer Lag:

Backpressure occurs when a consumer cannot keep up with the producer write rate. Events accumulate in the broker, increasing lag (the difference between the latest offset and the consumer's current offset). This manifests as increased latency and, if lag grows beyond retention windows, data loss.

Imagine a system where producers write 100,000 events per second but a consumer can only process 80,000 per second due to expensive downstream writes. Lag grows by 20,000 events per second. At this rate, a consumer with 10 million events of lag needs 500 seconds (over 8 minutes) to catch up, violating Service Level Objectives (SLOs) for real-time dashboards.

Production systems monitor lag in both time (seconds behind) and offset count. They alert when lag exceeds thresholds, for example over 30 seconds for operational dashboards or over 1 million events for analytics pipelines. Recovery strategies include autoscaling consumers, temporarily shedding non-critical work, or increasing partition count and rebalancing.

Time Related Edge Cases:

Events arrive late or out of order due to network delays, client device clock drift, or retries. A classic failure mode is using processing time semantics for business logic. If you aggregate by arrival time, late events land in the wrong window and produce incorrect business metrics.

Event time semantics with watermarks fix this, but introduce a new trade-off. Watermarks estimate event time progress. If you wait too long for late events, results are delayed. If you do not wait long enough, you drop late events or require corrections.

A production example: you configure a 5 minute allowed lateness window for hourly aggregations. Events arriving more than 5 minutes late are dropped or sent to a side output. During a network partition that lasts 10 minutes, events from that period are systematically late. Your system either drops them (incorrect counts) or requires reprocessing with a longer lateness window (delayed results).

Backpressure Cascade Timeline
NORMAL
0 sec lag
→
SLOW SINK
30 sec lag
→
QUEUE FULL
5 min lag
Exactly Once Semantics Limitations:

Many streaming systems claim exactly once processing, meaning each event is processed exactly once even across failures. This is achieved via transactional coordination: atomically committing consumer offsets and sink writes.

The limitation is that exactly once only applies within the streaming system boundaries. If you write to an external sink (database, cache, API) that is not transactional or idempotent, retries after failures can still produce duplicates. For example, a consumer writes an event to a database, crashes before committing its offset, restarts, and writes the same event again. The database now has a duplicate.

The solution is idempotent consumers. Design sinks to accept duplicates safely, using upsert operations or unique keys. Alternatively, use transactional sinks that coordinate with the streaming system, but this adds latency and complexity.

Schema Evolution Failures:

Event schemas evolve over time. Adding a new field to an event payload seems safe, but old consumers that expect a fixed schema can break. A consumer deserializes an event, encounters an unknown field, and throws an exception, halting processing.

Production systems use schema registries with compatibility rules. Forward compatibility means old consumers can read new schemas. Backward compatibility means new consumers can read old schemas. Full compatibility requires both. Enforcing these rules at schema registration time prevents runtime failures.

But in practice, bugs slip through. A producer deploys a breaking change, and a downstream consumer crashes. Recovery involves rolling back the producer, fixing the consumer to handle the new schema, or both. During recovery, event lag grows, potentially violating SLOs.

❗ Remember: Exactly once processing within the stream processor does not guarantee exactly once end-to-end delivery. External sinks must be idempotent or transactional to avoid duplicates on retry.

💡 Key Takeaways

✓Backpressure occurs when consumers cannot keep up with producer rate; lag grows by the throughput difference (100k writes minus 80k reads equals 20k events per second lag growth)

✓Monitoring lag in seconds and offset count is critical; typical SLOs alert at 30 seconds lag for real-time systems and 5 minutes for analytics

✓Event time with watermarks handles late arrivals but requires tuning allowed lateness: too short drops valid events, too long delays results

✓Exactly once processing in the stream does not guarantee exactly once end-to-end; external sinks must be idempotent or transactional to prevent duplicates on retry

✓Schema evolution failures occur when producers deploy breaking changes; schema registries with forward and backward compatibility rules prevent runtime crashes

📌 Interview Tips

1Consumer processing 80,000 events per second while producers write 100,000 per second accumulates 72 million events of lag per hour

2Network partition lasting 10 minutes causes events to arrive 10 to 15 minutes late; system configured with 5 minute lateness window drops them

3Stream processor claims exactly once but writes to non idempotent HTTP API; crash and retry causes duplicate POST requests and double charges

4Producer adds required field to event schema without backward compatibility check; old consumers crash on deserialization, halting processing

← Back to Event Streaming Fundamentals Overview