Database DesignColumn-Oriented Databases (Redshift, BigQuery)Hard⏱️ ~3 min

Write Patterns and Compaction: Why Columnar Stores Excel at Append but Struggle with Updates

Columnar segments are immutable for good reason. Once written, each column's values are tightly compressed (dictionary encoded, run length packed, bit packed) and indexed with min/max metadata for skipping. Any modification requires decompressing, editing, recompressing, and rewriting entire segments. This makes random single row updates prohibitively expensive, causing write amplification of 10x to 100x depending on segment size. A 1 KB row update might trigger rewriting a 100 MB segment. The solution is append only architecture with periodic compaction. New data lands in small mutable delta files or separate append partitions. Reads merge base segments with deltas at query time, adding latency. Background compaction jobs periodically rewrite base plus deltas into new optimized segments, then atomically swap metadata to point queries at the fresh data. Deletion works similarly via tombstone markers that filter results during reads until compaction physically removes rows. This pattern works beautifully for bulk ingestion. Batch loads of millions of rows every hour write large well compressed segments (1 GB each) that maximize scan throughput. Micro batches every 5 minutes streaming Change Data Capture (CDC) from transactional databases work too if buffered to produce 128 MB segments. The breaking point is high frequency updates, like incrementing counters or editing individual records hundreds of times per day. Each edit creates a new delta fragment, reads slow as they merge dozens of deltas, and compaction falls behind. Uber's AresDB illustrates the extreme. They built a real time columnar store ingesting millions of events per second with sub second query latency. The trick is accepting eventual consistency during a short window, buffering micro batches in memory before flushing to immutable segments, and running aggressive compaction on fresh data to keep read amplification low. Traditional column warehouses target minutes to hours of ingestion lag, making them unsuitable for operational use cases requiring fresh data in seconds.
💡 Key Takeaways
Immutable segments enable aggressive compression and indexing. Dictionary encoding and run length encoding on columnar data achieve 3 to 10x reduction, but any edit requires rewriting entire segment.
Write amplification punishes single row updates. Modifying 1 KB row in 100 MB segment triggers 100 MB rewrite, causing 100x amplification. Updates routed through delta files defer this cost but accumulate read overhead.
Bulk ingestion is optimal pattern. Batch loading millions of rows every hour produces large 1 GB segments that maximize compression and scan throughput. Micro batches every 5 minutes work if buffered to 128 MB segments.
Compaction reclaims performance and storage. Background jobs merge base segments with accumulated deltas, rewriting into fresh optimized segments. Without compaction, reads slow merging dozens of deltas.
High frequency updates break the model. Workload updating same rows hundreds of times per day creates massive delta accumulation. Compaction falls behind, queries degrade, and storage bloats with duplicate versions.
Eventual consistency trades freshness for throughput. Real time systems like Uber AresDB buffer micro batches in memory before flushing, accepting seconds of lag to maintain columnar benefits at millions of events per second.
📌 Examples
Netflix data lake ingestion pattern. Batch jobs run every hour, loading millions of events into new daily partitions as Parquet files. Each file is 1 GB compressed, optimized for compression and scan throughput. No updates or deletes, purely append only, eliminating compaction overhead.
E-commerce order updates anti pattern. Transaction system updates order status 5 to 10 times per order (pending, processing, shipped, delivered). Syncing to columnar warehouse creates 5 to 10 deltas per row. With 1 million orders per day, that is 5 to 10 million deltas daily. Compaction runs continuously, consuming resources and lagging behind. Solution: denormalize into append only event stream (order_events table) rather than updating order_status column.
← Back to Column-Oriented Databases (Redshift, BigQuery) Overview