Data Storage Formats & Optimization • Row-based vs Columnar FormatsEasy⏱️ ~2 min
What is Row vs Columnar Storage?
Definition
Storage format refers to how database systems physically organize data on disk. The two fundamental approaches are row based (storing complete records together) and columnar (storing values from the same field together).
user_id, name, email, signup_ts, followed by Row 2 with the same structure. If your workload usually needs "the full user" or "the full order," this layout is ideal. You pay one disk seek and get the entire record. Traditional relational databases like MySQL and PostgreSQL use row based storage because they optimize for transactional workloads.
Columnar Storage:
Flips the layout completely. It stores values column by column. One block contains only user_id values for thousands of rows, another block only country, another last_login. Analytical queries usually look like "compute an aggregate over a few columns across many rows." Columnar layouts minimize I/O because the engine reads only referenced columns. Systems like BigQuery, Snowflake, and file formats like Parquet and ORC use columnar storage.
"The format choice connects workload characteristics to storage layout, not just database brand names."
💡 Key Takeaways
✓Row based storage keeps all fields of a record together physically, optimized for reading or writing entire records with single disk access
✓Columnar storage groups values by column across many records, optimized for scanning few columns across millions of rows
✓OLTP workloads with point queries under 20 milliseconds latency need row based format for efficient single record access
✓Analytics workloads scanning billions of rows but only 5 to 10 columns benefit from columnar format which reads only needed columns
✓Columnar formats achieve 5x to 10x compression because similar data types stored together compress better than mixed record data
📌 Examples
1MySQL storing user profiles row based: fetching one user reads all fields (name, email, address) in a single page access
2BigQuery storing event logs columnar: query "sum revenue by country" reads only revenue and country columns, skipping hundreds of other fields
3Parquet file with 1 billion events and 200 columns: analytics query touching 5 columns reads 2 GB instead of 400 GB with row format