Natural Language Processing SystemsScalability (Model Parallelism, Batching)Medium⏱️ ~2 min

What is Model Parallelism and Why Do We Need It?

Definition
ML Scalability is the ability to handle increasing load - more requests, larger models, bigger datasets - without proportionally increasing latency or cost. It covers both training (processing more data) and serving (handling more inference requests).

Why ML Scaling is Different

Traditional web services scale by adding more stateless servers. ML systems are fundamentally different. A model is a large stateful artifact - a 7B parameter LLM is 14GB in float16. You cannot simply spin up more servers; each server needs the model loaded in memory. Loading takes 30-60 seconds for large models. Memory is the constraint, not CPU.

The scaling challenge varies by phase. Training scales with data parallelism: split your dataset across GPUs, each processes a portion, gradients are synchronized. Serving scales with model replication and request batching: run multiple model copies, batch incoming requests to maximize GPU utilization. These require different architectures.

The Cost of Not Scaling

An unscaled ML system hits walls fast. At 100 requests per second, your single GPU saturates. Queue depth grows. P99 latency spikes from 100ms to 5 seconds. Users abandon. Revenue drops. The fix is not obvious: more GPUs require load balancing, model synchronization, and batching logic that your original system was not designed for.

💡 Key Insight: GPU utilization is the key metric for ML cost efficiency. A GPU processing one request at a time might run at 10% utilization. Batching 32 requests together can push utilization to 70-80%, serving 30x more requests on the same hardware.

Scaling Dimensions

ML systems scale along three axes: model parallelism (splitting one model across multiple GPUs), data parallelism (running multiple model copies on different data), and request batching (processing multiple requests together). Each addresses different bottlenecks and they can be combined.

💡 Key Takeaways
ML scaling differs from web services because models are large stateful artifacts (14GB for 7B params) that take 30-60 seconds to load
Training scales with data parallelism (split dataset across GPUs), serving scales with model replication and request batching
GPU utilization is the key cost metric - single requests run at 10% utilization, batching 32 requests pushes to 70-80%
Three scaling axes: model parallelism (split model), data parallelism (multiple copies), request batching (process together)
📌 Interview Tips
1Explain why ML scaling is harder than web scaling: models are stateful 14GB artifacts, not stateless handlers. Memory is the constraint.
2Mention GPU utilization as your cost efficiency metric. Single request processing wastes 90% of GPU capacity.
3Distinguish training vs serving scaling: data parallelism for training, model replication with batching for serving.
← Back to Scalability (Model Parallelism, Batching) Overview