Pipeline Failure Modes and Edge Cases

Backpressure and Congestion:
The most common production failure is cascading backpressure. Stage 3 normally processes 100k messages per second. A bug in newly deployed code causes it to drop to 5k per second. The queue between Stage 2 and Stage 3 starts filling. If this queue is finite (say, 10 million messages), it fills in 2000 seconds, or 33 minutes. Once full, Stage 2 experiences write failures or throttling. This propagates upstream: Stage 2 slows, its input queue fills, eventually reaching the ingestion layer.

Without proper monitoring, this can go unnoticed until the entire pipeline is stalled. The fix: monitor queue depth and set alerts. When queue depth exceeds 50 percent of capacity, page on call. Implement autoscaling: if Stage 3 queue depth stays high for 5 minutes, automatically add more Stage 3 workers. Also implement circuit breakers: if a stage consistently fails to process messages, temporarily stop sending to it and route to a dead letter queue for manual inspection.

Cascading Backpressure Timeline
NORMAL
100k/s
→
BUG DEPLOYED
5k/s
→
QUEUE FULL
33 min
Poison Messages:
A malformed record enters the pipeline. Stage 2 tries to parse it, fails, and retries. It fails again, retries again. This message is now blocking progress for all messages behind it in the partition. Without proper handling, that partition is effectively stuck.

The solution is a retry policy with exponential backoff and a dead letter queue. Retry up to 3 times with increasing delays: 1 second, 10 seconds, 60 seconds. After 3 failures, move the message to a dead letter queue for manual inspection and continue processing subsequent messages. Track dead letter queue size as a metric: a sudden spike indicates a new class of malformed data.

Data Skew and Hotspots:
You partition by user_id. Most users generate 10 events per day. One celebrity user generates 100k events per day. The worker handling that partition becomes overloaded, processing at 5x higher latency than others. If downstream stages require data from all partitions (for example, a global aggregation), the slowest partition determines overall latency.

Mitigation strategies: First, use secondary partitioning. Instead of just user_id, hash on user_id plus event_type, spreading one user's events across multiple partitions. Second, implement dynamic partition splitting: if a partition consistently exceeds latency targets, automatically split it into two partitions. Third, use consistent hashing with virtual nodes: each physical worker claims 100 to 200 positions on the hash ring, spreading hotspots.

Out of Order and Late Arrivals:
In streaming pipelines, events often arrive out of order or late. Mobile clients lose connectivity, buffer events locally, reconnect 10 minutes later, and flush the buffer. Your pipeline receives events with timestamps 10 minutes in the past.

If you are computing hourly aggregates, naive logic that closes the 2pm to 3pm window at 3pm will miss late arrivals. Watermarking solves this: define a lateness bound (for example, 5 minutes) and keep windows open until current time exceeds window end plus lateness. Close the 2pm to 3pm window at 3:05pm. Events arriving after 3:05pm go to a late data side output for separate handling. Choose your lateness bound based on data: if 99 percent of events arrive within 2 minutes, a 5 minute bound captures 99.9 percent.

⚠️ Common Pitfall: Replaying historical data through a pipeline is tricky. External side effects like sending emails or updating a search index must be idempotent. Use idempotency keys: include a unique event_id and check before performing side effects. Have I already processed event_id 12345? If yes, skip. If no, process and record that I did.
Inconsistent Reference Data:
Stage 2 enriches events with user country from a user profile database. Stage 4 filters events by country. A user changes country during the day. Some events are enriched with old country (USA), some with new country (Canada). Stage 4 sees inconsistent data.

The fix is versioning reference data. Snapshot the user profile database at the start of each day. All stages use the same snapshot for that day's processing. This ensures consistency at the cost of slightly stale reference data. Alternatively, include a reference_data_version field in each event, allowing downstream stages to verify they are using matching reference data.

💡 Key Takeaways

✓Backpressure from a slow stage (100k/s drops to 5k/s) fills queues in 33 minutes and cascades upstream; monitor queue depth and autoscale workers when depth exceeds 50 percent

✓Poison messages block progress; implement 3 retry attempts with exponential backoff (1s, 10s, 60s) then move to dead letter queue

✓Data skew from power users causes 5x latency spikes; use secondary partitioning or consistent hashing with 100 to 200 virtual nodes per worker

✓Late data arrivals require watermarking: keep aggregation windows open for lateness bound (for example 5 minutes) to capture 99.9 percent of events

📌 Interview Tips

1During a deployment, Stage 3 latency spikes from 50ms to 800ms. Queue depth grows from 100k to 5M messages in 20 minutes. Autoscaler adds 10 workers, draining the backlog in 8 minutes.

2A mobile app reconnects after 15 minutes offline and sends 500 buffered events with old timestamps. Watermarking with 5 minute lateness captures most, but 20 events arrive after window close and go to late data output for reprocessing.

← Back to Pipeline Architecture Patterns Overview