Elasticsearch Hot/Warm/Cold/Frozen Architecture for Log and Metric Workloads

Why Logs Need Tiering
Log and metric data has a predictable access pattern: recent data is frequently queried, older data is rarely touched. A typical pattern shows 90% of queries hitting data less than 24 hours old. 9% hit data 1-7 days old. 1% hit older data. Keeping all logs on hot storage wastes resources. Tiering aligns storage cost with access probability.
The Four Tier Approach
Hot tier stores the most recent data on NVMe SSDs. Optimized for real time dashboards and alerting. Data stays here for 1-7 days. Warm tier stores recent historical data on slower SSDs or HDDs. Used for ad hoc investigation and trend analysis. Data stays 7-30 days. Cold tier stores older data on high capacity spinning disks. Used for compliance and rare investigations. Data stays 30-90 days. Frozen tier stores data beyond 90 days in compressed archives on cheapest storage. Queries require thawing time.
💡 Key Insight: Each tier uses different shard configuration. Hot nodes have high CPU for indexing and queries. Cold nodes prioritize storage capacity over CPU. Frozen nodes can use shared storage since data is rarely accessed.
Index Lifecycle Management
Index lifecycle management (ILM) automates tier transitions. A policy defines: roll over index when it reaches 50GB or 7 days, move to warm after 2 days, move to cold after 14 days, move to frozen after 30 days, delete after 90 days. Each transition can include actions like force merge (reduces segment count for query efficiency), shrink (reduces shard count), or read only marking.
Shard Allocation and Node Roles
Tiering requires dedicated node roles. Hot nodes are tagged with data_hot role. They have high CPU, fast SSD, and handle indexing. Warm nodes with data_warm role have moderate CPU, larger storage, and handle historical queries. Cold nodes with data_cold role maximize storage density. ILM policies use shard allocation filtering to move indices to appropriate node types based on age.
Query Behavior Across Tiers
Queries transparently span tiers but with different performance characteristics. A query for last 24 hours hits only hot tier and completes in milliseconds. A query for last 30 days spans hot, warm, and cold tiers with response time dominated by the slowest tier. Frozen tier queries require async restore: user submits query, system thaws data from archive, query executes when data is available (minutes to hours later). Smart query routing can warn users when queries will hit slow tiers.
Cost Optimization Patterns
A 10TB/day logging workload retaining 90 days without tiering requires 900TB of hot storage. With tiering: 70TB hot (7 days), 230TB warm (23 days), 600TB cold (60 days). The cold tier uses storage 3-4x cheaper per GB than hot. Combined with compression in colder tiers (data is immutable, can use heavy compression), total storage costs drop 60-70% compared to all hot storage.

💡 Key Takeaways

✓Log access patterns: 90% of queries hit data less than 24 hours old, justifying aggressive tiering from hot to cold

✓Four tier model: hot (1-7 days, NVMe), warm (7-30 days, SSD/HDD), cold (30-90 days, HDD), frozen (90+ days, archive)

✓Index lifecycle management automates rollover, tier transitions, force merge, shrink, and deletion based on age policies

✓10TB/day with 90 day retention reduces storage costs 60-70% through tiering vs all-hot storage

📌 Interview Tips

1Explain how query latency varies by tier - hot tier in milliseconds, frozen tier requires async thaw taking minutes to hours

2Describe a concrete ILM policy: rollover at 50GB or 7 days, warm after 2 days, cold after 14, frozen after 30, delete after 90

3Calculate storage requirements: 10TB/day x 90 days = 900TB without tiering, or 70TB hot + 230TB warm + 600TB cold with tiering

← Back to Storage Tiering (Hot/Warm/Cold) Overview