Failure Modes: Replication Lag, CDC Gaps, Schema Drift, and Runaway Costs

Replication lag and read your writes violations are common OLTP pitfalls when using read replicas for scaling or mixed workloads. Read replicas can lag under load, especially during heavy write bursts or large transactions. Clients routing reads to replicas may observe stale state: a user updates their profile, the write commits to the primary, but their next read hits a lagging replica and shows the old value. Monitor replication lag via logical log sequence positions or replication delay metrics; set alerting thresholds (for example, alert when lag exceeds 5 seconds) and route critical reads requiring read your writes consistency to primaries or employ session stickiness to pin a user's session to a single replica.

Change Data Capture pipelines introduce their own failure modes. Log based CDC can drop or duplicate events during connector failovers, network partitions, or when reading from a failed over primary with a different log position. Without idempotent upserts (using deterministic primary keys and versioning) and exactly once semantics at sinks, OLAP facts can be double counted: the same order gets inserted twice, inflating revenue by millions. Table reorganizations like type changes, tablespace moves, or vacuum full can reset log positions, triggering full table backfills that flood downstream systems with terabytes of data and overwhelm ingestion capacity, causing freshness to degrade from seconds to hours and breaching operational SLIs.

Schema drift and slowly changing dimension (SCD) handling break pipelines in subtle ways. Evolving schemas—column type changes (integer to bigint), column renames, or dimension rekeys (changing how product_id is assigned)—break Extract Transform Load (ETL) or Extract Load Transform (ELT) jobs if not handled carefully. Incorrectly handling SCDs leads to time travel inaccuracies: if you update a customer's region in place (Type 1 SCD) instead of versioning with effective dates (Type 2), historical revenue reports will retroactively attribute all past orders to the new region, violating financial reconciliation. Data skew and join blow ups cause resource exhaustion: a single high traffic customer or product can create a hot key that causes partition skew, resulting in one worker processing gigabytes while others sit idle. Cross joins or insufficient filter predicates can explode the join graph, generating trillions of intermediate rows that spill to disk and run for hours before out of memory termination.

Cost runaway is a production reality. Ad hoc queries that scan full history—querying all orders ever without date filters on a petabyte scale table—can spike costs by orders of magnitude in pay per scan cloud warehouses, generating five or six figure bills for a single query. Materializing too many aggregate combinations causes cube explosion: precomputing revenue by every combination of 10 dimensions with 10 values each creates 10 billion aggregate rows. Engineers must enforce guardrails: reject full history scans in interactive contexts, implement query cost estimation and approval workflows for expensive queries, and set per user or per team spending limits with automatic query cancellation when exceeded.

💡 Key Takeaways

✓Replication lag causes read your writes violations when replicas lag under load; monitor lag via log positions, alert at thresholds (example: 5 seconds), route critical reads to primary or use session stickiness

✓CDC pipelines can drop or duplicate events during failover; without idempotent upserts using deterministic keys and exactly once semantics, facts get double counted (orders inserted twice, inflating revenue)

✓Table reorganizations (vacuum, type changes) can reset CDC log positions, triggering full backfills that flood ingestion with terabytes and degrade freshness from seconds to hours

✓Incorrect slowly changing dimension handling (Type 1 in place updates instead of Type 2 versioning) causes time travel inaccuracies where historical reports retroactively attribute data to new dimension values, breaking financial reconciliation

✓Cost runaway from ad hoc full history scans can spike bills by orders of magnitude (example: petabyte scan generates six figure cost); enforce guardrails with query cost estimation, approval workflows, and per team spending limits with auto cancellation

📌 Interview Tips

1Uber trip database: replica lag spiked to 45 seconds during peak, causing riders to see stale trip status; mitigation was to route status reads to primary and set lag alert at 10 seconds

2Amazon analytics pipeline: CDC connector failover during schema migration caused 3 hours of duplicate order events; required idempotent merge using (order_id, version) composite key to deduplicate in warehouse

3Meta data warehouse: engineer ran ad hoc query without date filter, scanned 80 TB of historical events, generated $12,000 bill; implemented query cost estimator and approval gate for queries over 5 TB scan

← Back to OLTP vs OLAP Overview