Database DesignTime-Series Databases (InfluxDB, TimescaleDB)Hard⏱️ ~3 min

Compression Techniques: How Time Series Databases Achieve 10 to 100x Storage Reduction

Time series data compresses extraordinarily well because adjacent values are highly correlated. Timestamps increase monotonically, temperatures change gradually, and status codes repeat frequently. Time Series Databases (TSDBs) leverage specialized encodings to achieve 8 to 59x compression compared to naive row storage and 10 to 100x versus raw text or JSON, directly impacting storage cost and query performance. Timestamp compression uses delta of delta encoding. Instead of storing absolute Unix timestamps like 1609459200, 1609459201, 1609459202, you store the first value plus the delta (1 second), then the delta of deltas (0, 0, 0). For regular interval data, this collapses to nearly zero after the first few bytes. Gorilla compression from Facebook extends this by using variable length encoding for the delta of deltas, allocating fewer bits when values are predictable. Floating point values use Gorilla XOR compression. Consecutive temperature readings like 72.3, 72.4, 72.3 share many bits in their IEEE 754 representation. XORing consecutive values produces mostly zeros, which compress efficiently using leading and trailing zero suppression. Benchmarks show 10 to 20x reduction on typical sensor data. Integers use frame of reference encoding: store the minimum value in a block, then encode offsets with the minimum bits required. If all values in a 1000 point block are between 1000 and 1100, you store 1000 once and use 7 bits per offset. String and categorical data use dictionary and run length encoding. Status codes, region names, and device types have low cardinality. A dictionary maps strings to small integer IDs once per segment, then stores IDs instead of repeated strings. Run length encoding compresses sequences like OK OK OK OK ERROR OK OK into (OK, 4), (ERROR, 1), (OK, 2). Modern systems like InfluxDB 3.x and TimescaleDB persist data in Parquet columnar format, which applies these encodings automatically and enables SIMD vectorized queries that process thousands of values per CPU instruction. The result is that compressed columnar storage on cheap object storage often queries faster than uncompressed row storage on SSD because less I/O is required to scan the data.
💡 Key Takeaways
Delta of delta encoding compresses regular interval timestamps to nearly zero bits after the initial value by storing the difference between consecutive deltas, with Gorilla variable length encoding allocating fewer bits for predictable patterns
Gorilla XOR compression for floats leverages IEEE 754 bit representation where consecutive values like temperatures share many bits, XORing produces mostly zeros that compress with leading and trailing zero suppression achieving 10 to 20x reduction
Dictionary encoding maps low cardinality strings like status codes or region names to small integer IDs stored once per segment, then references IDs instead of repeating full strings
Real world compression achieves 8 to 59x smaller storage versus naive row formats depending on cardinality, with 100 devices and 1 metric compressing 59x while 4,000 devices and 10 metrics compress 8x
Parquet columnar format used by InfluxDB 3.x and TimescaleDB applies these encodings automatically and enables SIMD vectorized queries that process thousands of values per CPU instruction, making compressed data on cheap object storage query faster than uncompressed on SSD due to lower I/O
📌 Examples
Facebook Gorilla in-memory TSDB compresses 2 billion data points per machine using delta of delta timestamps and XOR float encoding, achieving 1.37 bytes per point on average versus 16 bytes uncompressed for 12x reduction.
InfluxDB 3.x persists data as Parquet files on S3. A test with 1 million metric points totaling 20 MB raw compressed to 500 KB (40x) using combined delta, dictionary, and run length encodings, with queries reading only relevant columns.
TimescaleDB compression policies automatically convert hot row oriented chunks to columnar compressed format after 7 days. A monitoring system saw storage drop from 10 TB to 1 TB while query performance improved due to less I/O required for range scans.
← Back to Time-Series Databases (InfluxDB, TimescaleDB) Overview
Compression Techniques: How Time Series Databases Achieve 10 to 100x Storage Reduction | Time-Series Databases (InfluxDB, TimescaleDB) - System Overflow