Feature Engineering & Feature StoresFeature Transformation Pipelines (Spark, Flink)Hard⏱️ ~3 min

Production Failure Modes: Backpressure, Skew, and State Explosion

Aggregation Consistency

Batch and streaming pipelines must produce identical aggregation results for the same input data. Subtle differences in window boundary handling, null treatment, or floating point arithmetic cause training serving skew. A batch job computing "average order value in last 7 days" must match the streaming pipeline exactly: same window boundaries aligned to midnight UTC, same handling of null prices (excluded vs zero), same rounding behavior.

Testing Strategies

Shadow mode runs streaming pipeline alongside batch, comparing outputs for overlapping time periods. Expect exact match for deterministic aggregations (count, sum) and bounded difference for floating point (average, standard deviation). Alert on divergence exceeding 0.1 percent of values or 1 percent magnitude difference. Continuous shadow comparison runs across hundreds of feature pipelines at major ML companies.

Schema Evolution

Adding, removing, or changing feature columns requires coordinated updates across batch jobs, streaming jobs, online stores, and model serving. A new feature added to streaming but not batch creates training serving skew. Feature stores enforce schema versioning: features have explicit version numbers, models declare required feature versions, and the serving layer validates compatibility at request time.

Failure Recovery

Batch jobs fail and retry at the job level. Streaming jobs fail and recover from checkpoints, replaying events since last checkpoint. For exactly once semantics, outputs must be idempotent: upserting to key value stores with entity plus window as key ensures replayed events produce same final state. Non idempotent sinks (append only logs) require deduplication downstream.

💡 Key Takeaways
Backpressure cascades when sinks degrade from 200ms to 2 second latency cause network buffers to fill, watermarks to stall, and P99 latency to spike from 50ms to 3 seconds across the entire pipeline
Data skew with 1% of keys generating 50% of traffic creates hot partitions at 100% CPU while others idle. Key salting distributes hot keys across partitions by appending random suffixes
State explosion from unbounded keys without TTLs grows state from 100GB to multi-terabyte, increasing checkpoint duration from 30 seconds to 10 minutes and recovery time from 2 minutes to 1 hour
Late data handling requires calibrating watermark lag to P99 inter-arrival times. 10 second lag drops 60 second late events, causing feature undercounts. 2 minute lag captures more but delays outputs
Incremental checkpoints reduce overhead by only writing changed state, cutting checkpoint time from 5 minutes to under 1 minute for typical 500GB state workloads
Monitor key metrics: watermark lag (alert if exceeds 2x expected), backpressure ratio (alert above 80%), checkpoint duration (alert if exceeds interval), and per-partition CPU skew (alert if max/min ratio exceeds 3x)
📌 Interview Tips
1LinkedIn activity streaming: Experienced backpressure when feature store writes degraded during database failover. Network buffers filled to 95%, watermark lag jumped from 5 seconds to 2 minutes. Fixed by adding write buffering and circuit breakers to shed load during degradation.
2Uber marketplace features: Hot driver IDs (drivers in high demand areas) caused 10x CPU skew across partitions. Implemented two-level aggregation: local 1 minute pre-aggregate per partition, then global combine, reducing hot partition load by 80%.
3Netflix viewing pipeline: Session window state without TTL grew to 3TB over 1 week, causing 15 minute checkpoints. Added 7 day TTL aligned to feature horizon, reduced state to 200GB, checkpoint time to 45 seconds, and recovery time from 30 minutes to 3 minutes.
← Back to Feature Transformation Pipelines (Spark, Flink) Overview