Failure Modes: What Breaks at 10x Scale

Data Quality Failures:
The most common production incident is not infrastructure failure, it's bad data silently propagating through pipelines. A microservice deploys a schema change, renaming order_total to total_amount. Your ingestion pipeline keeps reading the old column, which is now always null. Revenue dashboards show zero dollars for three days before anyone notices.

At advertising companies, a 1% undercount in ad spend metrics can mean millions in lost revenue or compliance issues. You need schema validation at ingestion, data quality checks at every transformation stage (validate row counts, null ratios, distribution shifts), and alerting when metrics fall outside expected ranges.

❗ Remember: Implement circuit breakers in pipelines. If an hourly job detects row counts dropped by more than 20% compared to the same hour last week, pause the pipeline and alert before propagating bad data to gold tables.
Late and Out of Order Data:
Event streams do not arrive in perfect order. A mobile app loses network, queues events locally, then uploads them 2 hours later. If you compute hourly revenue aggregates with strict time windows, you undercount. The event with event_time = 14:30 arrives at processing_time = 16:45, missing the 14:00 to 15:00 window.

Typical production systems allow a grace period: windows stay open for late data up to 1 to 6 hours depending on business requirements. After the grace period closes, a backfill mechanism recomputes affected aggregates. For financial metrics, you might keep windows open for 24 hours and run a final reconciliation job nightly to catch all late arrivals.

Late Data Impact
WITHOUT GRACE PERIOD
5% undercount
→
WITH 6 HOUR WINDOW
<0.1% undercount
Skew and Hotspots:
In a distributed system, work is partitioned across compute nodes. If partitioning is naive (for example, hash on customer_id), one large customer with 10x more data causes one task to run 10x longer. The entire job waits for the slowest task.

This is especially painful for joins. Joining users to events where one bot user has 1 million events while normal users have 10 creates a massive skew. One partition processes 1 million rows while others process 10, and p99 latency explodes even though p50 is fine.

Mitigations include salting: add a random suffix to skewed keys to split them across multiple partitions. For the bot user, you might split their events into 100 partitions, each processing 10,000 rows, bringing them in line with normal users. You can also use adaptive query execution that detects skew at runtime and repartitions dynamically.

Runaway Queries and Cost Explosions:
Someone runs SELECT * FROM events e JOIN users u ON e.user_id = u.id without a WHERE clause, joining 10 TB of events to 100 GB of users. The query scans 10 TB, spins up 200 compute nodes, runs for 15 minutes, and costs $150. In BigQuery, that is 2 TB scanned at $5 per TB. Do this 10 times a day and you have a $1,500 daily surprise.

Production warehouses enforce query limits: maximum runtime (10 to 30 seconds for ad hoc, 5 minutes for scheduled), maximum scanned bytes (1 TB per query for exploratory), and per user cost budgets. When a query exceeds limits, it gets killed automatically. You also implement query reviews: queries scanning over 1 TB require manager approval and optimization before running.

Eventual Consistency and Stale Views:
Your architecture has latency at every stage. CDC has 1 minute lag, transformation runs every 10 minutes, and materialized views refresh every 15 minutes. A user places an order at 10:00, but your revenue dashboard might not reflect it until 10:16. If you also have a real time event stream showing order count with 30 second lag, the dashboard briefly shows inconsistent numbers: order count increased but revenue did not.

This confuses executives who expect numbers to match. You need clear data freshness indicators on dashboards: "Revenue data as of 10:15, Order count as of 10:00." For critical metrics, you might invest in a faster path: stream events directly to an OLAP engine like Druid or ClickHouse with sub minute latency, accepting higher cost for real time visibility.

💡 Key Takeaways

✓Schema changes at source systems are the most common cause of data quality incidents. Implement validation at ingestion and circuit breakers that pause pipelines when row counts drop by more than 20%.

✓Late data causes undercounts. Production systems keep aggregation windows open for 1 to 6 hours and run nightly backfills to recompute affected metrics, reducing undercount from 5% to under 0.1%.

✓Skew from large customers or bot users causes one partition to process 10x more data, making p99 latency 10x worse than p50. Salting keys or adaptive repartitioning splits hot keys across multiple partitions.

✓Runaway queries without limits can scan 10 TB and cost $150 each. Enforce maximum runtime (30 seconds ad hoc), maximum scanned bytes (1 TB), and require approval for queries over thresholds.

📌 Interview Tips

1An e-commerce company's revenue dropped to zero in dashboards after a microservice renamed <code>order_total</code> to <code>total_amount</code>. Adding schema validation at ingestion caught this in staging before production.

2A social network with 10 million events per hour saw p99 query latency jump from 3 seconds to 45 seconds due to one bot account with 1 million events. Salting the bot's <code>user_id</code> across 100 partitions brought p99 back to 4 seconds.

3Uber's data platform keeps aggregation windows open for 6 hours to capture 99.9% of late arrivals, then runs a nightly reconciliation at 2am to backfill the remaining 0.1% and ensure financial metrics are accurate.

← Back to Modern Data Warehouse Architecture Overview