Failure Modes and Edge Cases

Mismatched Format and Workload:
The most common failure is using the wrong format for your access pattern. Running heavy analytics directly on a row based OLTP database causes catastrophic problems at scale. Consider a 5 TB transactional database with hot user facing traffic. A data analyst runs a daily aggregation scanning entire tables.

This causes lock contention on hot rows, page cache eviction for frequently accessed data, and large replication lag. Your p99 latencies for user facing traffic jump from 20 milliseconds to 500 milliseconds, even though the analytical query technically "works." Production incidents at companies often trace back to this exact scenario: an expensive analytical query bringing down the operational database.

OLTP Database Under Analytics Load
NORMAL
p99: 20ms
→
ANALYTICS QUERY
p99: 500ms
Columnar for Point Lookups:
The symmetric failure happens when teams try to use a pure columnar warehouse as the backing store for an API serving per user dashboards. Point lookups touching 200 columns for a single user require reading from many column segments, sometimes across multiple files and nodes.

Latency jumps from tens of milliseconds in a row store to hundreds of milliseconds or even seconds, especially under concurrency. Columnar systems are tuned for throughput, not tail latency. A production example: a team built a customer dashboard querying Redshift directly. Under 1,000 concurrent users, p99 latency hit 8 seconds, causing timeout errors and poor user experience. Moving to a row based cache fixed this.

Update and Delete Problems:
Columnar formats usually treat data as append only and implement updates as delete plus reinsert in a new batch. At high update rates, for example millions of updates per hour on a 10 TB table, this leads to many small fragments and frequent compaction jobs.

If compaction falls behind, queries slow down because they need to read more files and reconcile versions. This is a common failure mode in large data lakes. One company saw query times grow from 30 seconds to 10 minutes over a week as fragments accumulated, until an emergency compaction window brought performance back.

❗ Remember: Columnar write amplification means a single row update might trigger rewriting an entire 100 MB column chunk, multiplied by number of updated columns.
Sparse Wide Tables:
If most columns are null for most rows, row stores waste space on those nulls or per row metadata. Columnar storage can skip entire null segments efficiently, which is a win. However, if access patterns frequently need "all non null attributes for an entity," reconstructing wide rows from many sparse columns hurts performance.

A concrete case: a user attributes table with 500 optional fields where average user has 20 populated. Columnar storage compresses well, but queries fetching "all attributes for user X" must check 500 column chunks, even though 480 are null. Adding a row oriented cache for this access pattern solved the issue.

Consistency Between Stores:
Operationally, consistency between the OLTP row store and the OLAP columnar store is a failure area. If CDC pipelines lag by tens of minutes during peak traffic, dashboards built on columnar data become stale. For some businesses, this is acceptable. For others, like real time fraud detection, it is a critical correctness issue. One financial services company had a 15 minute CDC lag that allowed fraudulent transactions to complete before appearing in their monitoring dashboard.

💡 Key Takeaways

✓Running analytical scans on row based OLTP databases causes p99 latency to spike from 20 milliseconds to 500 milliseconds due to lock contention and cache eviction

✓Using columnar stores for point lookups increases latency from tens of milliseconds to seconds because reconstructing one row requires reading many column segments

✓High update rates (millions per hour) on columnar tables create fragment accumulation, degrading query performance from 30 seconds to 10 minutes without compaction

✓CDC pipeline lag between row OLTP and columnar OLAP stores creates consistency issues, with 15 minute delays enabling fraud in real time detection scenarios

✓Columnar write amplification means updating one row in 5 columns requires rewriting 500 MB of column chunks even though logical change is tiny

📌 Interview Tips

1Production incident: data analyst query scanning 5 TB user table on MySQL OLTP cluster caused 30 minute outage as replication lag hit 2 hours

2Customer dashboard backed by Redshift columnar store: p99 latency reached 8 seconds under 1,000 concurrent users doing single user lookups

3Data lake with 10 TB table receiving 2 million updates per hour: query times grew 20x over one week as compaction fell behind fragment creation rate

← Back to Row-based vs Columnar Formats Overview