Object Storage & Blob Storage • Storage Tiering (Hot/Warm/Cold)Medium⏱️ ~3 min
Elasticsearch Hot/Warm/Cold/Frozen Architecture for Log and Metric Workloads
Elasticsearch implements a four tier hot, warm, cold, frozen architecture to manage massive time series data volumes (logs, metrics, security events) while controlling infrastructure costs. Hot nodes run on NVMe SSDs with high CPU and memory, optimized for active indexing and low latency search. Queries against hot indices typically return in single digit to tens of milliseconds under load. After data ages past the active window (commonly 24 to 48 hours for logs), Index Lifecycle Management (ILM) rolls over the index and moves it to warm nodes.
Warm and cold nodes use cheaper disks (SATA SSDs or HDDs) and lower compute specifications. Indices on warm nodes remain fully indexed and searchable but accept increased query latency (tens to low hundreds of milliseconds) and slower aggregations due to disk throughput limits. Cold nodes further reduce cost by shrinking shard counts and compressing segments aggressively. The frozen tier introduces a fundamentally different model: indices are stored as searchable snapshots in object storage (S3, GCS, Azure Blob). Only the required segments are fetched on demand during queries, which trades query latency for dramatic cost reduction.
Real world deployments ingesting 1 terabyte per day of logs often retain 24 hours hot, 2 to 7 days warm, 30 days cold, and several months in frozen. This pattern cuts storage spend by 40 to 90 percent compared to keeping everything hot, while meeting investigative Service Level Agreements (SLAs) where analysts tolerate higher latency for historical data. The frozen tier achieves up to roughly 90 percent lower storage cost versus hot because object storage is cheaper per gigabyte per month and the cluster does not need to provision local disk for infrequently accessed data.
The tradeoff is tail latency. Queries spanning hot plus frozen shards can time out if not tier aware. Aggregations that fan out across months of frozen data fetch many segments from object storage, causing 99th percentile (P99) latency spikes. Production systems mitigate this by partitioning queries by time range and applying separate timeouts per tier, or by pre aggregating metrics in warm before freezing. Frozen also introduces cache pressure: the cluster caches fetched segments locally, so repeated access to the same frozen data becomes faster, but cache misses cause segment downloads that can saturate network or object store throughput during large recalls.
💡 Key Takeaways
•Hot nodes on NVMe SSDs deliver single digit to tens of milliseconds query latency for active indexing and search, typically retaining 24 to 48 hours of data for real time dashboards and alerting
•Warm and cold nodes use cheaper SATA SSDs or HDDs, accepting query latencies in tens to low hundreds of milliseconds; cold tier shrinks shard counts and compresses aggressively to reduce disk usage
•Frozen tier stores indices as searchable snapshots in object storage (S3, GCS, Azure Blob), fetching segments on demand; achieves up to roughly 90 percent storage cost reduction versus hot but introduces P99 latency spikes
•Production pipelines ingesting 1 terabyte per day often retain 24 hours hot, 2 to 7 days warm, 30 days cold, and several months frozen, cutting storage spend by 40 to 90 percent while meeting investigative SLAs
•Queries spanning hot and frozen shards risk timeouts if not tier aware; fan out aggregations across months of frozen data cause segment downloads that can saturate network or object store throughput
•Index Lifecycle Management (ILM) automates rollover, shrink, and snapshot transitions; teams must configure separate timeouts and pre aggregate metrics in warm to prevent frozen queries from degrading overall cluster performance
📌 Examples
A security operations center ingests 500 gigabytes per day of firewall logs, keeping 48 hours hot for real time threat hunting, 7 days warm for incident investigation, and 90 days frozen for compliance, reducing storage cost by 70 percent versus all hot
An observability platform retains 24 hours of metrics on hot nodes for live dashboards, moves to warm after 1 day, and freezes after 30 days; frozen queries during postmortems fetch segments on demand with P99 latencies reaching 2 to 5 seconds versus 50 milliseconds on hot
A compliance team runs quarterly audits querying 6 months of frozen audit logs; the first query per index incurs high latency (10 to 30 seconds) while segments download, but subsequent queries hit the local cache and complete in under 1 second