Database Design • Column-Oriented Databases (Redshift, BigQuery)Hard⏱️ ~3 min
Common Failure Modes: Cost Explosions, Skew, and Concurrency Limits in Production
Serverless cost explosions happen when queries accidentally scan entire tables. A BigQuery table holding 2 years of events at 100 TB total costs $500 to scan completely at $5 per TB. Forgetting the date filter in a WHERE clause, using a wildcard on partition column, or joining without proper filters triggers full scans. Teams report monthly bills jumping from $5000 to $50000 after a single engineer deploys a dashboard with unbounded queries. The mitigation is enforcing partition filter requirements, setting per user scan byte limits (like 10 TB per day), and alerting on queries exceeding thresholds.
Data skew creates stragglers that dominate query latency. When joining on user_id, a celebrity account with 100 million events lands on one worker while others process 10 thousand events each. That single worker spills to disk, taking 10 minutes while others finish in 10 seconds. The p99 latency spikes from 15 seconds to 10 minutes. Symptoms include high shuffle bytes, spill to disk warnings, and outlier task durations in query plans. Solutions involve salting hot keys (duplicate celebrity rows with random suffix, join multiple times, deduplicate), filtering before joining, or switching to approximate algorithms for heavy hitters.
Concurrency saturation manifests differently per model. MPP clusters have fixed query slots (Redshift might allocate 15 slots per cluster). When 20 BI dashboards refresh simultaneously, 5 queries queue. During peak load, average query time jumps from 5 seconds to 30 seconds (25 seconds queuing plus 5 seconds execution). Serverless systems throttle via slot quotas per project. Exceeding quota causes queries to wait for available slots, adding unpredictable latency. A BigQuery project with 2000 slot quota handling 50 concurrent queries experiences degradation when traffic spikes to 100 queries, as half wait for slots.
Small file proliferation is insidious. Streaming ingestion writing every minute creates 1440 files per day. With 100 partitions, that is 144 thousand files. Metadata operations slow (listing partitions takes seconds), scans lose locality (reading 1 KB from 144K files is slower than 144 MB from 1 file), and compression suffers (small files do not compress as well). Query latency degrades gradually from seconds to minutes. The fix is compaction during off peak hours to consolidate into larger files, or batching writes at source to produce fewer larger files (buffer 5 minutes of events, flush 5 MB files instead of 1 KB per event).
💡 Key Takeaways
•Cost explosions from missing partition filters. Query on 100 TB table without date filter scans everything at $500 versus $25 for 5 TB weekly window. Enforce partition filter requirements and set scan byte limits per user.
•Data skew creates stragglers that dominate latency. Celebrity user with 100 million events lands on one worker taking 10 minutes while others finish in 10 seconds. P99 latency spikes from 15 seconds to 10 minutes.
•Concurrency limits differ by model. MPP clusters queue when concurrent queries exceed fixed slots (15 slots, 20 queries means 5 wait). Serverless throttles via project slot quotas, degrading latency unpredictably under load spikes.
•Small file proliferation degrades scans. Streaming every minute creates 144 thousand files per day across 100 partitions. Metadata overhead and poor locality slow queries from seconds to minutes. Batch writes to produce 128 MB to 1 GB files.
•Shuffle and spill to disk signals. High shuffle bytes (tens to hundreds of GB) and spill warnings indicate poor join strategy or skew. Broadcast small tables under a few hundred MB or salt hot keys to distribute load.
•Monitoring bytes scanned is critical. Track per query and per user bytes scanned, set alerts on thresholds (like 10 TB per query), and review top offenders weekly to catch runaway queries before they drain budgets.
📌 Examples
Google case study on BigQuery cost control. Team set 5 TB daily scan limit per user and required partition filters on tables over 1 TB. Caught 15 queries per week exceeding limits, preventing estimated $50000 monthly overage. Alerts triggered on queries scanning over 1 TB without user acknowledgment.
Uber AresDB handling celebrity users in ride data. Implemented salting strategy: duplicate celebrity user rows with salt suffix 0 to 9, join on (user_id, salt), deduplicate results. Distributed 100 million celebrity events across 10 workers instead of overwhelming one. P99 latency dropped from 8 minutes to 45 seconds.