What is Model Parallelism and Why Do We Need It?
Why ML Scaling is Different
Traditional web services scale by adding more stateless servers. ML systems are fundamentally different. A model is a large stateful artifact - a 7B parameter LLM is 14GB in float16. You cannot simply spin up more servers; each server needs the model loaded in memory. Loading takes 30-60 seconds for large models. Memory is the constraint, not CPU.
The scaling challenge varies by phase. Training scales with data parallelism: split your dataset across GPUs, each processes a portion, gradients are synchronized. Serving scales with model replication and request batching: run multiple model copies, batch incoming requests to maximize GPU utilization. These require different architectures.
The Cost of Not Scaling
An unscaled ML system hits walls fast. At 100 requests per second, your single GPU saturates. Queue depth grows. P99 latency spikes from 100ms to 5 seconds. Users abandon. Revenue drops. The fix is not obvious: more GPUs require load balancing, model synchronization, and batching logic that your original system was not designed for.
Scaling Dimensions
ML systems scale along three axes: model parallelism (splitting one model across multiple GPUs), data parallelism (running multiple model copies on different data), and request batching (processing multiple requests together). Each addresses different bottlenecks and they can be combined.