Production Deployment: From Training Cost to Serving Savings

TRAINING COST VS SERVING SAVINGS
Distillation has upfront cost: running the teacher on training data, training the student for many epochs. A typical distillation run might cost $1,000 to $10,000 in compute. But serving savings accumulate: if the student saves $0.01 per 1,000 requests and you serve 10 million requests per day, you save $100 per day. The break even point might be 10 to 100 days. Calculate this before starting.
LATENCY IMPROVEMENTS
Beyond cost, smaller models are faster. A 10x smaller model typically has 3 to 5x lower latency (not 10x because memory bandwidth and overhead dominate at small sizes). For real time applications with 50ms latency budgets, distillation might be the difference between feasibility and impossibility. Measure actual latency, not just parameter count.
DEPLOYMENT ARCHITECTURE
After distillation, the teacher is discarded for serving. Only the student is deployed. This simplifies infrastructure: no need for large GPU instances, easier to scale horizontally, simpler failure handling. Some systems keep the teacher for occasional quality checks or for retraining the student when data distributions change.
Best Practice: Monitor student quality in production. If degradation exceeds 2 to 3 percent, investigate whether data drift requires re-distillation with updated teacher predictions.
ITERATIVE DISTILLATION
For extreme compression (100x smaller), distill in stages. First distill from 1B to 300M parameters. Then distill from 300M to 100M. Then from 100M to 30M. Each stage preserves more knowledge than trying to jump directly from 1B to 30M. The intermediate students act as curriculum, providing easier targets than the original teacher.

💡 Key Takeaways

✓Distillation upfront cost $1K-$10K; savings $0.01/1K requests; break-even in 10-100 days at scale

✓10x smaller model = 3-5x lower latency (not 10x due to memory bandwidth overhead)

✓Teacher discarded after training; only student deployed, simplifying infrastructure

✓Monitor student quality: >2-3% degradation suggests re-distillation needed for data drift

✓Iterative distillation: 1B to 300M to 100M to 30M preserves more than direct 1B to 30M

📌 Interview Tips

1Calculate break-even: $5K training, $0.01/1K savings, 10M daily requests = $100/day = 50 day payback

2Discuss latency: 10x smaller but only 4x faster due to overhead - measure, do not assume

3Explain staged distillation: each step is easier, intermediate models act as curriculum

← Back to Knowledge Distillation Overview