ML Model OptimizationKnowledge DistillationMedium⏱️ ~3 min

Production Deployment: From Training Cost to Serving Savings

In production machine learning systems, distillation sits between training and serving as a compression step that trades upfront compute for continuous serving cost reduction. The economic model is straightforward: you pay once to run a large teacher over datasets and train the student, then save on every inference request. For a text ranking service handling 10,000 requests per second with 50 millisecond p95 latency requirements, a BERT base model at 110 million parameters needs 60 to 120 milliseconds on central processing unit and 440 MB memory. This forces you onto graphics processing unit servers at 10x cost or fails latency service level objectives entirely. Distilling to a 66 million parameter student reduces latency to 40 milliseconds, fitting comfortably in the budget on commodity central processing units. At 10,000 queries per second, this translates to needing 400 central processing unit cores instead of 1200, or avoiding 50+ graphics processing unit instances entirely. The one time cost might be 500 graphics processing unit hours to generate soft targets over 100 million examples plus 200 graphics processing unit hours for student training, totaling roughly $2000 to $3000 on cloud providers. Monthly serving savings run $20,000 to $50,000 depending on scale, paying back the investment in days. On device scenarios are even more constrained. Mobile keyboards budget under 20 milliseconds per keystroke and under 50 MB model storage including embeddings and runtime. Speech models that require 100+ MB and 200+ milliseconds on servers must compress to 20 to 50 MB and under 50 milliseconds on mobile chips with no graphics processing unit and strict power limits. Google has reported distilling server speech models to on device students, keeping word error rate within 2 to 3 percent while meeting these budgets. For large language models, a 7 billion parameter teacher needs over 14 GB in half precision floating point, requiring expensive inference servers. Black box distillation to a 1 to 2 billion parameter student with acceptable quality can reduce memory to under 4 GB and latency by 3x to 5x, enabling deployment on lower cost infrastructure.
💡 Key Takeaways
Economics favor distillation at scale: $2000 to $3000 one time training cost versus $20,000 to $50,000 monthly serving savings at 10,000 queries per second throughput
Central processing unit serving becomes viable: 40 millisecond student versus 100 millisecond teacher eliminates need for 50+ graphics processing unit instances, reducing infrastructure cost by 10x
On device constraints are extreme: under 20 millisecond latency, under 50 MB storage, no graphics processing unit, tight power budgets require 5x to 10x compression from server models
Large language model distillation from 7 billion to 1 billion parameters cuts memory from 14 GB to under 4 GB and inference latency by 3x to 5x, enabling lower tier deployment
Payback period is days to weeks for high throughput services, making distillation immediately cost effective once serving volume exceeds a few hundred queries per second
📌 Examples
Meta text ranking pipeline: distill 12 layer transformer teacher to 6 layer student, achieve 45 millisecond p95 on central processing unit versus 110 millisecond teacher, reduce fleet from 800 to 400 servers saving $30,000 monthly
Google on device speech recognition: compress 150 MB server model to 30 MB mobile model, maintain word error rate within 2.5 percent, enable real time dictation under 50 millisecond latency on phone chips
OpenAI safety classifier distilled from large language model application programming interface: train 400 million parameter student on 1 million prompt response pairs, achieve 95 percent of teacher F1 score at 20x lower serving cost
← Back to Knowledge Distillation Overview
Production Deployment: From Training Cost to Serving Savings | Knowledge Distillation - System Overflow