Data Storage Formats & Optimization • Row-based vs Columnar FormatsMedium⏱️ ~3 min
Production Architecture: OLTP and OLAP Together
In realistic systems at companies handling large scale traffic, you rarely have a single database serving all use cases. Instead, you typically see a row based operational store plus a columnar analytics store, connected by data pipelines.
The E-commerce Example:
Imagine a platform handling 50,000 orders per second at peak with strict requirements: p95 write latency under 10 milliseconds and 99.99 percent availability. The order service writes every order event to a row based store optimized for random reads and writes, like a sharded relational database or key value store.
Each order operation needs to read or update all fields:
BigQuery as Reference:
BigQuery is designed to scan petabytes with columnar format. Public benchmarks show that scanning 1 TB of compressed columnar data and computing simple aggregates completes in a few seconds when the query is selective on columns. The engine pushes down predicates, skips blocks, and uses vectorized execution. The same query on a row based OLTP database would either time out or saturate the cluster.
user_id, items, shipping_address, payment_status. A row layout means a single page read retrieves the entire record. This keeps latency low and supports the transactional guarantees needed for order processing.
The Analytics Side:
In parallel, a Change Data Capture (CDC) pipeline streams order events into a data lake using columnar file format like Parquet. The analytics workload has completely different characteristics. Product teams run queries like "for the last 30 days, group total revenue by country, device type, and campaign."
These queries scan tens or hundreds of billions of rows but only need 5 to 10 columns out of maybe 200 total. With columnar storage, the query engine reads only those 5 to 10 columns. If each column compresses at 5x to 10x, a query that would require hundreds of gigabytes of I/O on a row store might need only tens of gigabytes.
Query Performance Comparison
Timeout
ROW OLTP
3 sec
COLUMNAR OLAP
✓ In Practice: Operational writes go to row store with millisecond latency, near real time replication flows into columnar files, analytics engines scan those files with high throughput but higher per query latency measured in seconds instead of milliseconds.
💡 Key Takeaways
✓Production systems separate OLTP row stores handling 50,000 writes per second with p95 under 10 milliseconds from OLAP columnar stores
✓CDC pipelines stream operational data into columnar data lakes, typically with minutes of replication lag acceptable for analytics
✓Columnar analytics queries scanning 1 TB complete in seconds by reading only 5 to 10 referenced columns with 5x to 10x compression
✓Row stores optimize for full record access needed by transactional APIs while columnar stores optimize for aggregate queries over billions of rows
✓BigQuery scanning compressed columnar data uses predicate pushdown and vectorized execution achieving throughput impossible on row based OLTP systems
📌 Examples
1E-commerce platform: MySQL cluster handles order writes at 50k/sec, Kafka streams changes to S3 Parquet files, Spark queries 30 days of orders in 5 seconds
2User analytics: DynamoDB serves profile lookups under 10 milliseconds, hourly export to columnar format enables cohort analysis across 500 million users
3Financial reporting: PostgreSQL processes transactions with ACID guarantees, nightly ETL loads into Redshift columnar warehouse for regulatory reports