Database DesignColumn-Oriented Databases (Redshift, BigQuery)Medium⏱️ ~3 min

Distributed Execution Models: Massively Parallel Processing (MPP) Clusters vs Serverless Pooled Compute

Column oriented warehouses distribute query execution across many workers, but the orchestration model fundamentally shapes performance and cost. Massively Parallel Processing (MPP) clusters like Amazon Redshift allocate dedicated compute nodes coordinated by a leader. You provision a fixed cluster (say 10 nodes), pay for uptime regardless of utilization, and own deterministic performance within that capacity. Serverless models like Google BigQuery allocate ephemeral slots from a shared pool per query, autoscale instantly, and charge only for bytes scanned. MPP clusters bundle storage with compute on local disks. Performance is predictable because you control distribution strategies (how rows spread across nodes) and sort keys (physical ordering). A well tuned cluster with data distributed evenly on user_id and sorted by timestamp delivers consistent 5 second response times for dashboard queries. The cost is operational overhead, manual scaling friction when workloads grow, and wasted spend during idle periods. Cluster economics favor steady predictable workloads where high utilization justifies roughly $300+ per TB per month equivalent cost. Serverless decouples storage from compute. Data lives in distributed object storage (like Google Colossus), and queries dynamically schedule parallel readers across thousands of slots. You pay roughly $20 per TB per month for storage and $5 per TB scanned for compute. A query reading 10 TB costs $50 whether it runs at 2am or during peak hours. This model excels at spiky unpredictable workloads and eliminates scaling decisions, but performance varies under multi tenancy and poor pruning explodes costs. A mistaken full table scan of 100 TB costs $500 in one query. Join execution differs too. MPP clusters use distribution keys to colocate join keys on the same nodes, avoiding shuffles for well designed schemas. Serverless systems broadcast small tables (under a few hundred MB) to all workers or shuffle both sides on the join key. Poor distribution in MPP creates hot nodes and stragglers; poor broadcast decisions in serverless create massive network shuffles that spill to disk and slow queries from seconds to minutes.
💡 Key Takeaways
MPP clusters provide predictable performance with dedicated resources. Redshift cluster might cost $5000 per month for 10 nodes, delivering consistent query times but wasting budget during idle periods.
Serverless charges per bytes scanned, shifting optimization to partition and cluster pruning. Query scanning 10 TB costs $50 regardless of time of day, but poor filter pushdown can explode to $500 for 100 TB scan.
Join strategy depends on model. MPP colocates data via distribution keys to avoid shuffles. Serverless broadcasts tables under a few hundred MB or shuffles both sides, making small dimension tables critical for performance.
Concurrency limits differ. MPP clusters hit fixed query slots and queue under load. Serverless throttles via slot quotas but masks individual node failures by rescheduling tasks across the pool.
Storage costs diverge dramatically. MPP bundles storage with compute at roughly $300+ per TB per month equivalent. Serverless separates storage at $20 per TB per month, making cold archive queries economical.
Operational tradeoff is control versus abstraction. MPP requires tuning distribution keys, sort keys, and vacuum schedules. Serverless eliminates ops but makes performance less predictable under multi tenancy.
📌 Examples
Amazon Redshift deployment for steady BI workload. Provision 10 node cluster at $5000 per month. Distribute fact table on user_id, sort by timestamp. Dashboard queries consistently return in 5 seconds. During off hours, 60% of capacity sits idle but cost remains fixed.
Google BigQuery for spiky data science workload. Store 50 TB at $1000 per month. Ad hoc queries scan 1 to 10 TB, costing $5 to $50 each. Weekly 50 TB full scan for model training costs $250. Monthly total around $2000 to $3000, only paying for actual usage.
← Back to Column-Oriented Databases (Redshift, BigQuery) Overview
Distributed Execution Models: Massively Parallel Processing (MPP) Clusters vs Serverless Pooled Compute | Column-Oriented Databases (Redshift, BigQuery) - System Overflow