Distributed Execution Models: Massively Parallel Processing (MPP) Clusters vs Serverless Pooled Compute

MPP Cluster Architecture
MPP (Massively Parallel Processing) clusters distribute query execution across dedicated compute nodes coordinated by a leader. You provision a fixed cluster (say 10 nodes), control data distribution (how rows spread across nodes via a distribution key), and define sort keys (physical ordering within each node). This provides deterministic performance: a well-tuned cluster delivers consistent 5-second response times for dashboard queries because you control the hardware.
The cost: you pay for uptime regardless of utilization, manually scale when workloads grow, and waste spend during idle periods. A 10-node cluster at ,000/month runs continuously even if 60% of capacity sits idle overnight. Economics favor steady, predictable workloads with high utilization.
Serverless Pooled Compute
Serverless models decouple storage from compute. Data lives in distributed object storage, and queries dynamically schedule parallel readers across a shared pool of slots (execution units). You pay per bytes scanned (~/TB) plus storage (~/TB/month). Query scanning 10 TB costs  whether it runs at 2am or peak hours.
This model excels at spiky, unpredictable workloads and eliminates scaling decisions. However, performance varies under multi-tenancy (shared resources with other users), and poor partition pruning explodes costs: a mistaken full table scan of 100 TB costs  in one query.
Join Execution Differences
MPP clusters use distribution keys to colocate join keys on the same nodes, avoiding expensive shuffles (moving data between nodes). If fact table and dimension table share distribution key user_id, joins are local. Serverless systems broadcast small tables (< 100-300MB) to all workers or shuffle both sides on join key, causing network I/O.
Key Trade-off: MPP gives control and predictable performance at higher fixed cost. Serverless gives elasticity and pay-per-query at variable performance. Choose based on workload predictability and budget model.

💡 Key Takeaways

✓MPP clusters allocate dedicated nodes with fixed capacity; you control distribution keys and sort keys for predictable 5-10 second query times

✓MPP cost is fixed regardless of utilization: ,000/month cluster costs the same whether running 24/7 or idle 60% of the time

✓Serverless charges per bytes scanned (~/TB) plus storage (~/TB/month); 10TB query costs regardless of time of day

✓Serverless performance varies under multi-tenancy (shared pool with other users); poor pruning causes cost explosions ( for 100TB scan)

✓MPP uses distribution keys for colocated joins (no shuffle); serverless broadcasts small tables (<300MB) or shuffles both sides

✓Choose MPP for steady predictable workloads with high utilization; choose serverless for spiky unpredictable usage patterns

📌 Interview Tips

1Calculate MPP vs serverless: 10TB/day scanning at /TB = /day = ,500/month serverless. MPP cluster at ,000/month only makes sense above 33TB/day sustained.

2Explain distribution key benefit: fact and dimension both distributed on user_id means joins happen locally on each node. No shuffle = 10x faster than redistributing.

3Describe multi-tenancy impact: serverless query gets 1,000 slots in quiet period, finishes in 10s. Same query during peak gets 200 slots (others using pool), finishes in 50s.

← Back to Column-Oriented Databases (Redshift, BigQuery) Overview