Failure Modes and Operational Challenges

Batch Backlog: The Most Common Production Issue:
The most visible failure mode in micro-batch systems is batch backlog, which occurs when processing time exceeds batch interval. With a 5 second interval, if each batch starts taking 8 to 10 seconds due to a traffic spike or slow downstream system, batches queue up rapidly.

The math is unforgiving. If processing time is 8 seconds and interval is 5 seconds, you fall behind by 3 seconds per batch. After 10 batches, you're 30 seconds behind. After 100 batches (under 10 minutes of clock time), you're 5 minutes behind. Your p99 latency degrades from under 20 seconds to minutes, violating SLOs and potentially causing cascading failures as monitoring systems trigger alerts.

Backlog Accumulation Timeline
NORMAL
3s per batch
→
DEGRADED
8s per batch
→
BACKLOG
+3s/batch

Mitigation requires aggressive autoscaling that monitors processing time to interval ratio. When this ratio exceeds 60 to 70 percent for three consecutive batches, the system should add workers proactively. However, autoscaling has its own latency: cloud instances take 60 to 180 seconds to provision, during which backlog continues to grow.

Checkpointing and Partial Failures:
A subtler problem occurs when a job crashes after writing partial outputs for a micro-batch but before committing its checkpoint. The system recorded that it processed offset range 1000 to 2000, wrote half the aggregates to the analytics store, then died.

A naive design either loses data (if you advance the checkpoint anyway) or double processes it (if you replay the batch and the sink isn't idempotent). The standard solution is making sinks idempotent using batch identifiers. Each write includes both the data and a compound key: batch_id plus source offset_range. If you replay the batch, the sink recognizes duplicate writes and ignores them.

Alternatively, use transactional sinks where each batch is an atomic database transaction or file commit. Delta Lake, Iceberg, and Hudi all support this pattern, making each micro-batch an ACID (Atomicity, Consistency, Isolation, Durability) transaction at the table level.

Data Skew at Micro-batch Scale:
Suppose 90 percent of events in a given 5 second interval belong to one hot key. A viral video generating millions of views, or a single large enterprise customer dominating traffic. If your partitioning is by video identifier or customer identifier, one task processes 900,000 events while others process 10,000.

That task becomes a straggler, taking 10x longer than others. The entire micro-batch completes only when the slowest task finishes, so p99 latency spikes from 5 seconds to 50 seconds for that batch. This causes downstream effects: the next batch can't start, backlog accumulates, and monitoring alerts fire.

⚠️ Common Pitfall: Interviewers often ask how you handle hot keys. The answer involves salting: append a random suffix (0 to 9) to hot keys, spreading them across 10 partitions, then combining results in a downstream aggregation step.
Late and Out-of-Order Data:
With micro-batch windowing, you must decide how long to wait for late events. Watermarks typically lag event time by several minutes. If you set a watermark at "event time minus 5 minutes," any event more than 5 minutes late is dropped or handled separately.

The trade-off is sharp. Aggressive watermarks (short lag) reduce latency and state size but increase the rate of late events that require correction. Conservative watermarks (long lag) capture more late events but increase memory usage and end-to-end latency. For a 5 second micro-batch with a 5 minute watermark, you're holding 60 batches worth of state for each window.

Cascading Failures and Recovery:
Cluster level failures create their own challenges. If you lose multiple workers or experience a region outage, you may need to replay dozens or hundreds of micro-batches from old checkpoints. This spikes compute usage and generates massive load on downstream sinks.

Production systems design for controlled catch-up by limiting concurrent batch replays. Instead of replaying 100 batches in parallel, replay 5 at a time. Use admission control on sinks: a database that normally handles 10,000 writes per second shouldn't suddenly face 100,000 writes per second from catch-up traffic. Rate limiting and back pressure prevent recovery from triggering secondary failures.

💡 Key Takeaways

✓Batch backlog occurs when processing time exceeds interval: with 5 second intervals and 8 second processing, you fall behind 3 seconds per batch, degrading p99 latency from 20 seconds to minutes within 10 minutes

✓Checkpointing failures require idempotent sinks that use compound keys like batch_id plus offset_range, or transactional sinks where each micro-batch is an atomic commit

✓Data skew from hot keys causes stragglers that delay entire batches: mitigate by salting hot keys with random suffixes to spread load across multiple partitions

✓Watermark configuration trades off late event capture versus latency: a 5 minute watermark with 5 second batches means holding 60 batches of state per window

✓Cluster failures requiring replay of 100+ batches need controlled catch-up with rate limiting: replay 5 batches concurrently instead of 100 to prevent overwhelming downstream sinks

📌 Interview Tips

1A video streaming analytics pipeline experiences hot key skew when a viral video generates 5 million views in one micro-batch. The system salts the video_id with suffixes 0 through 9, distributing load across 10 tasks, reducing straggler time from 45 seconds to 6 seconds.

2An e-commerce events pipeline uses batch_id (UUID) plus offset_range as compound keys when writing to PostgreSQL. When a worker crashes after partial writes, replay attempts insert with same keys, violating unique constraints, which PostgreSQL silently ignores, achieving idempotence.

3A monitoring system sets watermarks at event_time minus 2 minutes for 5 second micro-batches. During a network partition, 3 percent of events arrive 5+ minutes late, requiring a separate late events correction pipeline that updates historical aggregates once per hour.

← Back to Micro-batching Techniques Overview