Feature Engineering & Feature StoresFeature Transformation Pipelines (Spark, Flink)Hard⏱️ ~3 min

Production Failure Modes: Backpressure, Skew, and State Explosion

Streaming pipelines fail in characteristic ways under production load. Backpressure cascades occur when downstream sinks or operators cannot keep up with input rate. Network buffers fill, watermarks stall, and latency spikes from milliseconds to seconds. A feature store write path taking 200ms per batch under normal load might degrade to 2 seconds during traffic spikes, causing upstream queues to overflow and triggering backpressure that propagates to Kafka brokers. The symptom is rising P99 latency, uneven task utilization, and watermark lag growing unbounded. Data skew creates hotspots when a small subset of keys dominates traffic. A fraud detection pipeline partitioned by user ID might see 1% of users generating 50% of events (bots, super-active accounts). The hot partition's task slot runs at 100% CPU while others idle at 20%, becoming a bottleneck. Solutions include key salting (append random suffix to distribute hot keys across partitions), hierarchical aggregation (local pre-aggregation per partition, then global combine), and load aware partitioners that monitor per-key rates and dynamically rebalance. State explosion happens with unbounded key cardinality, missing TTLs, or long windowed joins. A session window operator tracking all device IDs globally without TTL accumulates billions of keys over days, pushing state from 100GB to 5TB. Checkpoint duration grows from 30 seconds to 10 minutes, recovery time from 2 minutes to 1 hour, and eventually tasks run out of memory. The fix is TTLs aligned to business requirements (if only 24 hour sessions matter, drop state after 48 hours), approximate structures (sketches instead of exact counts), and incremental checkpoints to limit snapshot overhead. Late and out of order data mishandling causes feature undercounts or duplicates. If watermark lag is 10 seconds but events arrive 60 seconds late during upstream retries, those events get dropped from windows. A click through rate (CTR) feature will undercount clicks, biasing model training. The tradeoff is looser watermarks (2 minute lag) that capture more late data but delay outputs and increase memory for buffering. Side outputs for late events allow reprocessing or reconciliation.
💡 Key Takeaways
Backpressure cascades when sinks degrade from 200ms to 2 second latency cause network buffers to fill, watermarks to stall, and P99 latency to spike from 50ms to 3 seconds across the entire pipeline
Data skew with 1% of keys generating 50% of traffic creates hot partitions at 100% CPU while others idle. Key salting distributes hot keys across partitions by appending random suffixes
State explosion from unbounded keys without TTLs grows state from 100GB to multi-terabyte, increasing checkpoint duration from 30 seconds to 10 minutes and recovery time from 2 minutes to 1 hour
Late data handling requires calibrating watermark lag to P99 inter-arrival times. 10 second lag drops 60 second late events, causing feature undercounts. 2 minute lag captures more but delays outputs
Incremental checkpoints reduce overhead by only writing changed state, cutting checkpoint time from 5 minutes to under 1 minute for typical 500GB state workloads
Monitor key metrics: watermark lag (alert if exceeds 2x expected), backpressure ratio (alert above 80%), checkpoint duration (alert if exceeds interval), and per-partition CPU skew (alert if max/min ratio exceeds 3x)
📌 Examples
LinkedIn activity streaming: Experienced backpressure when feature store writes degraded during database failover. Network buffers filled to 95%, watermark lag jumped from 5 seconds to 2 minutes. Fixed by adding write buffering and circuit breakers to shed load during degradation.
Uber marketplace features: Hot driver IDs (drivers in high demand areas) caused 10x CPU skew across partitions. Implemented two-level aggregation: local 1 minute pre-aggregate per partition, then global combine, reducing hot partition load by 80%.
Netflix viewing pipeline: Session window state without TTL grew to 3TB over 1 week, causing 15 minute checkpoints. Added 7 day TTL aligned to feature horizon, reduced state to 200GB, checkpoint time to 45 seconds, and recovery time from 30 minutes to 3 minutes.
← Back to Feature Transformation Pipelines (Spark, Flink) Overview