Common Failure Modes: Cost Explosions, Skew, and Concurrency Limits in Production
Cost Explosions
Serverless cost explosions happen when queries accidentally scan entire tables. A 100TB table costs $500 to scan completely at $5/TB. Forgetting the date filter, using a wildcard on partition column, or joining without proper filters triggers full scans. Teams report monthly bills jumping from $5,000 to $50,000 after a single engineer deploys a dashboard with unbounded queries.
Mitigation: enforce partition filter requirements on large tables, set per-user scan limits (10TB/day), and alert on queries exceeding thresholds.
Data Skew and Stragglers
Data skew creates stragglers (slow workers) that dominate query latency. When joining on user_id, a celebrity account with 100 million events lands on one worker while others process 10 thousand each. That single worker spills to disk, taking 10 minutes while others finish in 10 seconds. P99 latency spikes from 15 seconds to 10 minutes.
Symptoms: high shuffle bytes, spill-to-disk warnings, outlier task durations. Solutions: salt hot keys (add random suffix, join multiple times, deduplicate), filter before joining, or use approximate algorithms for heavy hitters.
Concurrency Saturation
MPP clusters have fixed query slots (e.g., 15 slots). When 20 dashboards refresh simultaneously, 5 queries queue. Average time jumps from 5s to 30s (25s queuing + 5s execution). Serverless systems throttle via slot quotas per project. Exceeding quota causes unpredictable latency as queries wait for available slots.
Small File Proliferation
Streaming ingestion writing every minute creates 1,440 files per day. With 100 partitions, that is 144,000 files. Metadata operations slow (listing takes seconds), scans lose locality (reading 1KB from 144K files is slower than 144MB from 1 file), and compression suffers. Query latency degrades gradually from seconds to minutes. Fix: batch writes to produce 128MB-1GB files, or run compaction during off-peak hours.