TSDB Write Path: WAL, In-Memory Buffers, and Compaction

The write path in a Time Series Database (TSDB) is designed to absorb bursty, high velocity ingestion while maintaining durability and enabling efficient queries. The architecture follows an append first pattern with three key layers: Write Ahead Log (WAL) for durability, in memory buffers for fast writes and out of order handling, and background compaction for query optimization and compression.

When a sample arrives, it is immediately appended to the WAL (a sequential, disk based log) and then inserted into an in memory buffer that accepts out of order writes within a bounded reorder window (typically seconds to minutes). This window is critical because network jitter and batching cause metrics to arrive late. For example, Uber's M3 platform ingests tens of millions of metric samples per second while handling arrivals that are minutes out of order. The in memory buffer is organized by series identifier (hash of measurement plus tags) and maintains a sorted structure per series.

Background processes periodically flush these buffers to immutable segments on disk, applying aggressive compression during the conversion. Time Structured Merge (TSM) formats, similar to Log Structured Merge (LSM) trees, report 45x space improvement over traditional B-tree layouts by exploiting time locality and columnar encoding. Delta of delta encoding stores timestamp differences between differences (if timestamps are 100, 110, 120, store 10, then 0 representing no change in the delta). XOR encoding for floats takes advantage of the fact that consecutive metric values often share many bits.

The compaction process merges overlapping segments, deduplicates records by series identifier and timestamp using last write wins semantics with sequence numbers, and creates larger consolidated segments. This trades write amplification (same data written multiple times during merges) for read efficiency and space savings. Systems manage this tradeoff by staggering compactions across time ranges and capping segment sizes to bound worst case merge costs.

💡 Key Takeaways

•Write Ahead Log (WAL) provides durability through sequential disk appends before in memory insertion

•In memory reorder window handles out of order arrivals within bounded delay (typically 2 to 5 minutes) to prevent double counting in aggregates

•Deduplication uses series identifier plus timestamp as composite key with sequence numbers for last write wins semantics during retries

•Time Structured Merge (TSM) format achieves 45x space improvement over B-tree by exploiting time locality and columnar encoding

•Compression techniques: delta of delta for timestamps (store 10, then 0 if interval is constant), XOR for floats (consecutive values share bits)

•Compaction tradeoff: write amplification (data written multiple times) versus read efficiency and space savings, managed by staggering merges and capping segment sizes

📌 Examples

Uber M3 platform: ingests tens of millions of metric samples per second with minutes of out of order tolerance using in memory reorder buffers

InfluxDB TSM engine: reports 10 to 100x compression when persisting to columnar object storage with background compaction and deduplication

Delta of delta encoding example: timestamps 1000, 1010, 1020, 1030 stored as 1000 (base), 10 (delta), 0 (delta of delta), 0, 0 saving significant space

← Back to Time-Series Databases Overview