Database DesignColumn-Oriented Databases (Redshift, BigQuery)Medium⏱️ ~3 min

Distributed Execution Models: Massively Parallel Processing (MPP) Clusters vs Serverless Pooled Compute

MPP Cluster Architecture

MPP (Massively Parallel Processing) clusters distribute query execution across dedicated compute nodes coordinated by a leader. You provision a fixed cluster (say 10 nodes), control data distribution (how rows spread across nodes via a distribution key), and define sort keys (physical ordering within each node). This provides deterministic performance: a well-tuned cluster delivers consistent 5-second response times for dashboard queries because you control the hardware.

The cost: you pay for uptime regardless of utilization, manually scale when workloads grow, and waste spend during idle periods. A 10-node cluster at ,000/month runs continuously even if 60% of capacity sits idle overnight. Economics favor steady, predictable workloads with high utilization.

Serverless Pooled Compute

Serverless models decouple storage from compute. Data lives in distributed object storage, and queries dynamically schedule parallel readers across a shared pool of slots (execution units). You pay per bytes scanned (~/TB) plus storage (~/TB/month). Query scanning 10 TB costs whether it runs at 2am or peak hours.

This model excels at spiky, unpredictable workloads and eliminates scaling decisions. However, performance varies under multi-tenancy (shared resources with other users), and poor partition pruning explodes costs: a mistaken full table scan of 100 TB costs in one query.

Join Execution Differences

MPP clusters use distribution keys to colocate join keys on the same nodes, avoiding expensive shuffles (moving data between nodes). If fact table and dimension table share distribution key user_id, joins are local. Serverless systems broadcast small tables (< 100-300MB) to all workers or shuffle both sides on join key, causing network I/O.

Key Trade-off: MPP gives control and predictable performance at higher fixed cost. Serverless gives elasticity and pay-per-query at variable performance. Choose based on workload predictability and budget model.
💡 Key Takeaways
MPP clusters allocate dedicated nodes with fixed capacity; you control distribution keys and sort keys for predictable 5-10 second query times
MPP cost is fixed regardless of utilization: ,000/month cluster costs the same whether running 24/7 or idle 60% of the time
Serverless charges per bytes scanned (~/TB) plus storage (~/TB/month); 10TB query costs regardless of time of day
Serverless performance varies under multi-tenancy (shared pool with other users); poor pruning causes cost explosions ( for 100TB scan)
MPP uses distribution keys for colocated joins (no shuffle); serverless broadcasts small tables (<300MB) or shuffles both sides
Choose MPP for steady predictable workloads with high utilization; choose serverless for spiky unpredictable usage patterns
📌 Interview Tips
1Calculate MPP vs serverless: 10TB/day scanning at /TB = /day = ,500/month serverless. MPP cluster at ,000/month only makes sense above 33TB/day sustained.
2Explain distribution key benefit: fact and dimension both distributed on user_id means joins happen locally on each node. No shuffle = 10x faster than redistributing.
3Describe multi-tenancy impact: serverless query gets 1,000 slots in quiet period, finishes in 10s. Same query during peak gets 200 slots (others using pool), finishes in 50s.
← Back to Column-Oriented Databases (Redshift, BigQuery) Overview
Distributed Execution Models: Massively Parallel Processing (MPP) Clusters vs Serverless Pooled Compute | Column-Oriented Databases (Redshift, BigQuery) - System Overflow