Common Failure Modes: Cost Explosions, Skew, and Concurrency Limits in Production

Cost Explosions
Serverless cost explosions happen when queries accidentally scan entire tables. A 100TB table costs $500 to scan completely at $5/TB. Forgetting the date filter, using a wildcard on partition column, or joining without proper filters triggers full scans. Teams report monthly bills jumping from $5,000 to $50,000 after a single engineer deploys a dashboard with unbounded queries.
Mitigation: enforce partition filter requirements on large tables, set per-user scan limits (10TB/day), and alert on queries exceeding thresholds.
Data Skew and Stragglers
Data skew creates stragglers (slow workers) that dominate query latency. When joining on user_id, a celebrity account with 100 million events lands on one worker while others process 10 thousand each. That single worker spills to disk, taking 10 minutes while others finish in 10 seconds. P99 latency spikes from 15 seconds to 10 minutes.
Symptoms: high shuffle bytes, spill-to-disk warnings, outlier task durations. Solutions: salt hot keys (add random suffix, join multiple times, deduplicate), filter before joining, or use approximate algorithms for heavy hitters.
Concurrency Saturation
MPP clusters have fixed query slots (e.g., 15 slots). When 20 dashboards refresh simultaneously, 5 queries queue. Average time jumps from 5s to 30s (25s queuing + 5s execution). Serverless systems throttle via slot quotas per project. Exceeding quota causes unpredictable latency as queries wait for available slots.
Small File Proliferation
Streaming ingestion writing every minute creates 1,440 files per day. With 100 partitions, that is 144,000 files. Metadata operations slow (listing takes seconds), scans lose locality (reading 1KB from 144K files is slower than 144MB from 1 file), and compression suffers. Query latency degrades gradually from seconds to minutes. Fix: batch writes to produce 128MB-1GB files, or run compaction during off-peak hours.

💡 Key Takeaways

✓Cost explosion from missing partition filter: 100TB scan at /TB = vs for 5TB weekly window; enforce filters on large tables

✓Data skew creates stragglers: celebrity user with 100M events on one worker takes 10 min while others finish in 10s, dominating p99

✓Straggler symptoms: high shuffle bytes, spill-to-disk warnings, outlier task durations in query plan; salt hot keys to distribute

✓MPP concurrency limited by fixed slots (15 slots, 20 queries = 5 queued); serverless throttles via project slot quotas

✓Small file proliferation: 144K files degrades metadata ops and scan locality; batch writes to produce 128MB-1GB files

✓Monitor bytes scanned per query, set alerts on thresholds (10TB), review top offenders weekly to catch runaway queries

📌 Interview Tips

1Debug straggler query: query plan shows 10 tasks finished in 10s, 1 task running 10 min with 50GB spill. User_id 12345 has 100M rows. Salt with suffix 0-9, join 10 times, dedupe.

2Cost control setup: require partition filter on tables >1TB, set 10TB/day scan limit per user, alert on queries >5TB. Catches 90% of runaway queries.

3Small file compaction: streaming creates 144K tiny files. Run nightly compaction job to merge into 1GB files. Query latency drops from 2 min to 15s.

← Back to Column-Oriented Databases (Redshift, BigQuery) Overview