TSDB Storage Tiering and Retention Lifecycle

Time Series Databases (TSDBs) manage data lifecycle through storage tiering that balances query latency, storage cost, and retention duration. The pattern is to keep recent high resolution data in fast expensive storage (memory and Solid State Drives) and progressively migrate older data to slower cheaper storage (object stores like Amazon Simple Storage Service S3) while downsampling resolution to reduce volume.

A typical three tier architecture places the last few hours to days in hot storage (memory or local SSD) targeting p50 query latencies in the double digit millisecond range for dashboard queries. This tier handles the majority of query traffic because operational monitoring focuses on recent windows. Warm storage (weeks to months) uses attached SSDs or network storage with higher latency (100 to 200 milliseconds) but lower cost per gigabyte. Cold storage (months to years) leverages object stores with latencies of seconds but storage costs 10 to 50x cheaper than SSD and effectively unlimited capacity.

Downsampling happens during tier transitions to control storage growth. Raw data collected every 10 seconds transitions to 1 minute averages after 7 days, then 5 minute averages after 30 days, and finally 1 hour summaries for long term retention. This reduces storage volume by 6x, 30x, and 360x respectively while preserving the ability to answer coarse grained historical questions. InfluxDB reports 10 to 100x compression when persisting to columnar object storage formats like Parquet combining downsampling with encoding optimizations.

Retention policies automate Time To Live (TTL) deletions and tier transitions. The key is that time partitioned storage enables atomic drops of entire segments without scattered deletes and tombstones that cause read amplification. A day partition that reaches its TTL is simply removed from metadata and the storage reclaimed asynchronously. Production systems like Uber's M3 maintain months of retention in distributed stores with automatic downsampling and tiering, processing tens of millions of samples per second while keeping query costs predictable as data ages.

💡 Key Takeaways

•Three tier architecture: hot storage (memory/SSD) for last 24 hours at 10 to 50ms p50 latency, warm storage (network SSD) for 7 to 30 days at 100 to 200ms, cold storage (object stores) for months to years at seconds latency

•Downsampling during tier transitions: raw 10 second data to 1 minute averages (6x reduction) after 7 days, 5 minute averages (30x reduction) after 30 days, 1 hour summaries (360x reduction) for long term

•Cost tradeoffs: SSD storage at $0.10 per GB per month versus object storage at $0.002 per GB per month (50x cheaper) with higher query latency acceptable for historical analysis

•InfluxDB reports 10 to 100x compression when persisting to columnar object storage formats like Parquet combining downsampling with delta of delta and XOR encoding

•Time partitioned storage enables atomic drops for TTL: entire day partition removed from metadata without scattered deletes and tombstones that cause read amplification

•Query routing: recent queries hit hot tier for fast dashboards, historical queries automatically routed to appropriate tier based on time range with resolution matched to tier

📌 Examples

Uber M3 platform: months of retention in distributed stores with automatic downsampling and tiering processing tens of millions of samples per second with minute level resolution for older data

Typical lifecycle: application metrics at 10 second resolution for 7 days in SSD (8.6 million points per series), downsample to 1 minute for 30 days (43,200 points), then 1 hour for 1 year (8,760 points) with final storage in S3

Cost example: 1 TB hot storage at $100/month for 24 hours, 10 TB warm at $500/month for 30 days, 100 TB cold at $200/month for 1 year with 10x downsampling at each tier reducing total volume

TTL automation: retention policy automatically drops partitions older than 90 days, entire day segments removed atomically at midnight without impacting active queries or requiring compaction

← Back to Time-Series Databases Overview