Cost Blow-ups and Performance Pathologies

Data warehouses fail spectacularly when architectural assumptions break down, and the economics of scan based billing can turn a single mistake into thousands of dollars per hour. Understanding these failure modes is essential for production resilience.

Cost blow-ups happen when queries scan far more data than necessary. An unpartitioned 10 terabyte fact table costs 50 dollars to scan once at 5 dollars per terabyte. If a Business Intelligence (BI) tool with auto refresh queries this every 60 seconds with 50 concurrent users, that is 50 scans per minute or 72,000 scans per day, translating to 3.6 million dollars per day. The fix is partition pruning: partition by date and ensure queries include date predicates to scan only relevant partitions. Similarly, the many tiny files anti pattern where billions of small files create massive metadata overhead can throttle throughput to single digit megabytes per second instead of gigabytes per second.

Performance pathologies often stem from data skew and wrong join strategies. If customer purchases follow a power law where 1 percent of customers generate 50 percent of orders, hash partitioning on customer key sends half the data to one node. That straggler node takes 50 times longer than others, and the query cannot complete until it finishes. The solution is salting: append a random suffix to hot keys to distribute load, or switch join strategies to broadcast the smaller dimension. Wrong join strategy selection (broadcasting a 100 gigabyte dimension instead of hash distributing) explodes memory usage and spills to disk, pushing latency from 10 seconds to 10 minutes.

Data quality and consistency failures are subtle but devastating. Change Data Capture (CDC) replication lag of even 5 minutes means reports combine today's facts with yesterday's dimensions, producing incorrect metrics. Late arriving facts that get assigned yesterday's dimension surrogate keys before a dimension update lands create orphaned records or double counting in aggregates. Non atomic multi table loads yield torn reads where dashboards show partially updated gold layer marts with mismatched totals. The fix is transactional semantics: load all related tables in a single commit, or use blue green swaps where a new complete version of the mart is built offline then atomically swapped into production.

💡 Key Takeaways

✓Unpartitioned large tables with Business Intelligence (BI) auto refresh can cost 3.6 million dollars per day when 50 users scan 10 terabytes every minute at 5 dollars per terabyte scanned

✓Data skew where 1 percent of keys account for 50 percent of data creates straggler nodes that dominate query latency, fixable by salting hot keys or switching to broadcast joins

✓Many tiny files anti pattern (billions of small files) causes severe metadata overhead throttling throughput from hundreds of gigabytes per second to single digit megabytes per second

✓Wrong join strategy (broadcasting 100 gigabyte dimension instead of hash distributing) explodes memory usage, spills to disk, and increases query latency from 10 seconds to 10 minutes

✓Change Data Capture (CDC) replication lag of just 5 minutes creates torn snapshots where reports combine today's facts with yesterday's dimensions, producing incorrect business metrics

✓Late arriving facts assigned dimension surrogate keys before dimension updates land create orphaned records or double counts in aggregates, requiring idempotent upserts with deterministic key assignment

📌 Interview Tips

1E-commerce company with exploratory analyst joining 10 TB fact to 5 TB dimension without predicates, triggering 75 dollar scan, repeated by BI tool caching for 50 dashboard viewers costing 3,750 dollars in one hour

2Streaming pipeline writing 100,000 files per hour each 10 MB instead of batching into 1 GB files, reducing BigQuery scan throughput from 200 GB per second to 5 GB per second due to metadata overhead

3Retail mart loaded in sequence: fact table at 2:00 AM with today's dimension keys, dimension SCD update at 2:05 AM, causing 5 minute window where dashboard shows facts attributed to wrong customer segments

4Social network with influencer accounts generating 1,000x more events than typical users, hash partitioning on user_id sends 80 percent of data to 2 out of 100 nodes, query takes 40x longer than balanced load

← Back to Data Warehousing Architecture Overview