Computer Vision SystemsMulti-task LearningMedium⏱️ ~3 min

Production Implementation and Serving Architecture

Production multi task serving executes one shared forward pass, then runs task specific heads in parallel or sequence depending on dependencies. The shared encoder dominates cost, typically 80 to 90% of total inference time. For example, a BERT style transformer encoder might take 12 milliseconds, while four lightweight fully connected heads take 1 millisecond each. Total latency becomes 16 milliseconds if heads run in parallel. This meets a 20 millisecond p99 Service Level Objective (SLO), whereas four separate models at 12 milliseconds each would exceed 40 milliseconds and violate SLO. Infrastructure must handle partial failures and timeouts per head. Set individual head timeouts, often 1 to 2 milliseconds, and return default predictions if a head times out. This prevents one slow head from blocking the entire response. Cache heavy computations like user embeddings that are shared across all heads. At 100,000 QPS, caching user embeddings for 100 milliseconds can save thousands of encoder calls per second. Quantization and mixed precision inference reduce memory and accelerator cost. INT8 quantization typically cuts model size and latency by 2 to 3 times with less than 1% accuracy loss. Training pipelines must align labels with different delays. Click labels are available within seconds, conversion labels arrive hours or days later. Build per task label pipelines with strict event time tracking. During training, sample complete records per task, ensuring feature cutoff times match label collection times to prevent leakage. Use asynchronous label updates and replay windows for delayed labels. For example, retrain daily with a 7 day window, so conversion labels from days 1 through 6 can still update model parameters. Monitoring splits metrics by task and by user segment. Track per task Area Under the Curve (AUC), calibration error, and prediction distribution. A win on overall utility can hide a regression on a minority task or demographic slice. Set per task Service Level Agreements (SLAs) and trigger automatic rollback if any task degrades beyond threshold. Large teams use traffic shadowing and canary rollouts, sending 1% of traffic to the new model and comparing per task metrics before full deployment.
💡 Key Takeaways
Shared encoder takes 80 to 90% of inference time (12ms example), while lightweight heads add 1 to 2 milliseconds each
Parallel head execution with per head timeouts (1 to 2ms) and fallback predictions prevents one slow head from blocking response
Caching user embeddings for 100 milliseconds can save thousands of encoder forward passes per second at high QPS
Label alignment across tasks requires strict event time tracking, click labels immediate versus conversion labels delayed by hours to days
Per task Service Level Agreements (SLAs) with automatic rollback prevent silent regressions on minority tasks during deployment
📌 Examples
Google ad serving: 150ms p99 SLO across CTR, CVR, quality score heads, per head timeout at 20ms with last known good fallback
Uber dispatch: Multi task model predicts ETA, surge, driver acceptance in 30ms, quantized INT8 model cuts latency from 50ms to 30ms with 0.8% AUC drop
Meta News Feed ranking: Canary rollout to 1% traffic, per task AUC and calibration checked every 5 minutes, auto rollback if engagement AUC drops below 0.02 threshold
← Back to Multi-task Learning Overview
Production Implementation and Serving Architecture | Multi-task Learning - System Overflow