Production Deployment: From Training Cost to Serving Savings
TRAINING COST VS SERVING SAVINGS
Distillation has upfront cost: running the teacher on training data, training the student for many epochs. A typical distillation run might cost $1,000 to $10,000 in compute. But serving savings accumulate: if the student saves $0.01 per 1,000 requests and you serve 10 million requests per day, you save $100 per day. The break even point might be 10 to 100 days. Calculate this before starting.
LATENCY IMPROVEMENTS
Beyond cost, smaller models are faster. A 10x smaller model typically has 3 to 5x lower latency (not 10x because memory bandwidth and overhead dominate at small sizes). For real time applications with 50ms latency budgets, distillation might be the difference between feasibility and impossibility. Measure actual latency, not just parameter count.
DEPLOYMENT ARCHITECTURE
After distillation, the teacher is discarded for serving. Only the student is deployed. This simplifies infrastructure: no need for large GPU instances, easier to scale horizontally, simpler failure handling. Some systems keep the teacher for occasional quality checks or for retraining the student when data distributions change.
ITERATIVE DISTILLATION
For extreme compression (100x smaller), distill in stages. First distill from 1B to 300M parameters. Then distill from 300M to 100M. Then from 100M to 30M. Each stage preserves more knowledge than trying to jump directly from 1B to 30M. The intermediate students act as curriculum, providing easier targets than the original teacher.