Loading...
Data Warehousing Fundamentals • Modern Data Warehouse ArchitectureHard⏱️ ~3 min
Advanced Pattern: Separation of Storage and Compute
The Architectural Breakthrough:
Traditional data warehouses bundled storage and compute in a single appliance. You bought a 100 TB system with 50 compute nodes. If you needed more storage, you upgraded the whole appliance. If you needed more compute for year end reporting, you were stuck. This coupling killed elasticity and forced massive upfront capital expenditure.
Modern cloud warehouses separate storage from compute. Data lives in cheap, durable object storage (Amazon S3, Google Cloud Storage, Azure Blob). Compute clusters spin up on demand, read from storage, process queries, and write results back. Compute and storage scale independently.
How It Works in Practice:
Snowflake pioneered this architecture. Data is stored in micro partitions (typically 16 MB compressed) in S3, organized by table and partition keys. When you run a query, Snowflake spins up a virtual warehouse (a cluster of compute nodes), pulls relevant micro partitions from S3 into local SSD cache, executes the query, and streams results back. If the warehouse is idle for a few minutes, it suspends automatically, stopping compute charges while data remains in S3.
The cache layer is critical. Reading from S3 has 10 to 50ms latency per request. For a query scanning 1,000 micro partitions, that is 10 to 50 seconds just in network time. The local SSD cache on compute nodes reduces this to under 1ms for cached data. Snowflake's query optimizer prunes partitions aggressively: if you query
Cost Impact
$0.02/GB
STORAGE COST
$2-10/hr
COMPUTE COST
WHERE date = '2024-01-15', it skips 99% of partitions, reading only the relevant 10 GB instead of 1 TB.
Elasticity and Cost Model:
The power is elasticity. You can run 5 small warehouses for different teams, each auto scaling independently. The data science team runs a 2 node warehouse for exploration, costing $4 per hour when active. During quarter end, finance spins up a 64 node warehouse for 2 hours to process a massive report, costing $160. Both query the same data in S3 without copying or moving it.
This changes the cost model from fixed CapEx to variable OpEx. A traditional warehouse might cost $500k upfront for hardware plus $50k annual maintenance. A cloud warehouse costs zero upfront, $2,000 per month for 100 TB storage, and $5,000 to $30,000 per month for compute depending on usage. Low usage teams save massively. High usage teams pay more but get infinite scale.
⚠️ Common Pitfall: Teams spin up large warehouses and forget to set auto suspend. A 32 node warehouse left running 24/7 costs $60,000 per month. Always set auto suspend to 5 to 10 minutes of inactivity and use the smallest warehouse that meets SLAs.
Concurrency and Isolation:
Separating storage from compute enables true workload isolation. You cannot create one warehouse for production dashboards and another for ad hoc queries. If someone runs a 10 minute query in the ad hoc warehouse, it does not steal resources from production dashboards. They are physically separate compute clusters reading the same data.
This also enables fine grained cost attribution. Each warehouse has usage metrics: queries run, compute hours, data scanned. You can charge back data science teams for their warehouse usage while subsidizing critical business dashboards. This visibility drives optimization: teams see their $10,000 monthly bill and start writing better queries.
The Trade-off: Network Becomes the Bottleneck:
Separating storage from compute means every query pulls data over the network from object storage. For small queries scanning a few GB, this adds 100 to 500ms latency versus reading from local disk. For very large queries scanning 100 TB, network bandwidth (typically 10 to 25 Gbps per node) can become the bottleneck.
To mitigate this, systems use aggressive caching, partition pruning, and predicate pushdown. Snowflake maintains a distributed cache across compute nodes: if your query scans the same partitions as a recent query, it hits cache and avoids S3 entirely. BigQuery uses a columnar format (Capacitor) and only reads the columns you need, reducing network transfer by 10x to 50x versus reading full rows.
Durability and Failure Handling:
Object storage like S3 provides eleven nines durability. Your data is replicated across multiple availability zones and can survive data center failures. If a compute node fails mid query, the coordinator detects it within seconds and reschedules affected tasks on other nodes. The query might slow down by 10% to 20% but completes successfully.
For multi region disaster recovery, some teams replicate data to a secondary region. Snowflake supports cross region replication with automated failover. If the primary region goes down, you can switch to the replica region within minutes. This adds cost (2x storage, cross region bandwidth) but ensures business continuity for critical workloads.💡 Key Takeaways
✓Separating storage and compute allows independent scaling. Store 100 TB in S3 for $2,000/month while running small 2 node warehouses for $4/hour or massive 64 node warehouses for $80/hour as needed.
✓Local SSD caches on compute nodes reduce latency from 10 to 50ms per S3 request to under 1ms for cached data, making separation viable for interactive queries.
✓Workload isolation via separate warehouses prevents ad hoc queries from impacting production dashboards. Each warehouse has its own compute and cost attribution.
✓Network becomes the bottleneck for queries scanning 100+ TB. Mitigate with aggressive partition pruning (skip 99% of data), columnar formats (read only needed columns), and distributed caching.
📌 Examples
1Snowflake stores data as 16 MB micro partitions in S3. A query with <code>WHERE date = '2024-01-15'</code> on a 1 TB table reads only 10 GB of relevant partitions, scanning 1% of data and completing in 2 seconds instead of 3 minutes.
2A finance team spins up a 64 node warehouse for 2 hours during quarter end to process massive reports, costing $160 total. The same warehouse auto suspends after 10 minutes idle, avoiding $60,000/month in wasted spend.
3BigQuery uses columnar Capacitor format. Scanning <code>user_id</code> and <code>revenue</code> from a 50 column table transfers 4% of data over the network versus row based formats that transfer 100%, reducing query time from 30 seconds to 1.2 seconds.
Loading...