ML Model OptimizationKnowledge DistillationMedium⏱️ ~3 min

Production Deployment: From Training Cost to Serving Savings

TRAINING COST VS SERVING SAVINGS

Distillation has upfront cost: running the teacher on training data, training the student for many epochs. A typical distillation run might cost $1,000 to $10,000 in compute. But serving savings accumulate: if the student saves $0.01 per 1,000 requests and you serve 10 million requests per day, you save $100 per day. The break even point might be 10 to 100 days. Calculate this before starting.

LATENCY IMPROVEMENTS

Beyond cost, smaller models are faster. A 10x smaller model typically has 3 to 5x lower latency (not 10x because memory bandwidth and overhead dominate at small sizes). For real time applications with 50ms latency budgets, distillation might be the difference between feasibility and impossibility. Measure actual latency, not just parameter count.

DEPLOYMENT ARCHITECTURE

After distillation, the teacher is discarded for serving. Only the student is deployed. This simplifies infrastructure: no need for large GPU instances, easier to scale horizontally, simpler failure handling. Some systems keep the teacher for occasional quality checks or for retraining the student when data distributions change.

Best Practice: Monitor student quality in production. If degradation exceeds 2 to 3 percent, investigate whether data drift requires re-distillation with updated teacher predictions.

ITERATIVE DISTILLATION

For extreme compression (100x smaller), distill in stages. First distill from 1B to 300M parameters. Then distill from 300M to 100M. Then from 100M to 30M. Each stage preserves more knowledge than trying to jump directly from 1B to 30M. The intermediate students act as curriculum, providing easier targets than the original teacher.

💡 Key Takeaways
Distillation upfront cost $1K-$10K; savings $0.01/1K requests; break-even in 10-100 days at scale
10x smaller model = 3-5x lower latency (not 10x due to memory bandwidth overhead)
Teacher discarded after training; only student deployed, simplifying infrastructure
Monitor student quality: >2-3% degradation suggests re-distillation needed for data drift
Iterative distillation: 1B to 300M to 100M to 30M preserves more than direct 1B to 30M
📌 Interview Tips
1Calculate break-even: $5K training, $0.01/1K savings, 10M daily requests = $100/day = 50 day payback
2Discuss latency: 10x smaller but only 4x faster due to overhead - measure, do not assume
3Explain staged distillation: each step is easier, intermediate models act as curriculum
← Back to Knowledge Distillation Overview