Failure Modes and Production Edge Cases

The Hidden Dangers: Even well designed OLTP and OLAP systems fail in predictable ways at scale. Knowing these edge cases is critical for interviews and production systems.

OLTP Failure Mode: Analytics Contamination

The most common OLTP failure is allowing analytical workloads to run on the primary transactional database. A single long running report query that scans millions of rows can lock tables or saturate I/O bandwidth, causing p99 latency spikes from 100ms to 5+ seconds for user traffic. This directly violates Service Level Agreements (SLAs) and impacts revenue.

High contention on hot rows is another classic problem. If thousands of concurrent transactions update a single inventory counter or a global configuration table, you hit lock timeouts and deadlocks when Query Per Second (QPS) grows 10x during traffic spikes. The solution is sharding hot keys or moving counters to specialized systems like Redis, but this requires careful design.

Replication Lag Edge Case:

Many companies add read replicas to offload read queries from the OLTP primary. If replicas are used for near real time analytics, replication lag becomes critical. At peak traffic, lag can grow from 50 milliseconds to multiple seconds or even minutes, leading to confusing inconsistencies where a user sees stale data on a dashboard seconds after making a purchase.

Replication Lag Impact
NORMAL
50 ms
→
PEAK TRAFFIC
5 sec
→
CRISIS
3 min

The mitigation is monitoring lag metrics and routing critical reads back to the primary when lag exceeds a threshold, say 500 milliseconds. Accept higher primary load rather than serving stale data.

OLAP Failure Mode: Data Correctness Issues

Batch ETL jobs can fail silently or partially, leading to double counting, missing partitions, or mismatched dimensions. If a daily job that loads 1 Terabyte (TB) of clickstream data fails halfway through and is re run incorrectly without deduplication, dashboards show inflated metrics for that day. Detection requires row count validation, checksum comparison, and referential integrity checks between fact and dimension tables.

Schema drift is another classic issue. Application teams change OLTP schemas or event payloads to add new fields or modify types, but ETL logic and OLAP schemas don't evolve in sync. Suddenly the pipeline breaks with type mismatch errors or silently drops columns, breaking downstream analytics. The solution is schema versioning and backwards compatible transformations.

Resource Contention in OLAP:

A single poorly written query that scans an unpartitioned petabyte table without time filters can starve other queries, pushing their latencies from seconds to tens of minutes. Large organizations deploy workload management with query limits (max scan size, max runtime) and resource queues to isolate heavy queries from interactive dashboards.

The Hybrid Edge Case: Near Real Time Features

Fraud detection, ad bidding, and recommendation systems often need features computed on data that's no more than 5 minutes old. Full batch OLAP with hourly or daily refreshes is too slow, but pure OLTP can't aggregate years of history in real time without crushing performance.

The solution is streaming analytics: process event streams with micro batches (seconds to minutes) to compute rolling aggregates and push them to fast key value stores. You accept eventual consistency and out of order event handling as trade offs for freshness.

✓ In Practice: Netflix handles out of order events in streaming pipelines by using watermarks and late arrival windows. Events arriving more than 10 minutes late are dropped to prevent indefinite state growth. This trades completeness for bounded memory.

💡 Key Takeaways

✓Analytics queries on OLTP primary can spike p99 latency from 100ms to 5+ seconds, violating SLAs and requiring immediate query kill or workload isolation

✓Replication lag grows from 50ms to minutes under peak traffic; mitigate by routing critical reads to primary when lag exceeds 500ms threshold

✓ETL failures cause silent data corruption through double counting or missing partitions; detect with row count validation and checksum comparison across pipeline stages

✓Streaming analytics for near real time needs (under 5 minute lag) requires handling out of order events with watermarks and accepting eventual consistency

📌 Interview Tips

1OLTP failure: Report query locks order table for 30 seconds, causing 5,000 user requests to timeout and retry, cascading into database connection pool exhaustion

2OLAP failure: Daily ETL job re runs without deduplication after partial failure, dashboard shows 2x actual revenue for that day until data team manually corrects partition

← Back to OLAP vs OLTP Trade-offs Overview