Computer Vision Systems • Multi-task LearningMedium⏱️ ~3 min
Hard vs Soft Parameter Sharing Strategies
Hard parameter sharing is the dominant production approach. All tasks share a single feature extractor with identical weights, and only the final output heads differ. If you have 10 million parameters in the shared backbone and three tasks with 50,000 parameters per head, the total model size is 10.15 million parameters. Running three separate models would require 30 million parameters total. This 3x reduction translates directly to memory footprint, cache efficiency, and serving cost.
Soft parameter sharing keeps separate models per task but constrains them to remain similar through regularization penalties or lateral connections called cross stitch units. Each task has its own backbone, but a regularization term penalizes the L2 distance between corresponding layer weights across tasks. This is useful when tasks are related but not identical. For example, predicting CTR and predicting video completion rate both benefit from user preference modeling, but video tasks need temporal features that ad CTR does not. Forcing identical features can hurt both tasks.
The trade off is clear. Hard sharing minimizes compute and memory, making it the default choice when tasks are closely aligned and when serving latency is critical. At 200,000 requests per second, every millisecond of latency costs servers. Soft sharing provides more modeling flexibility and can prevent negative transfer when task alignment is uncertain, but it costs 2 to 3 times more in serving resources. Production systems typically start with hard sharing and move to soft sharing or mixture of experts only when hard sharing demonstrates negative transfer.
Advanced architectures combine both. Multi gate mixture of experts (MMoE) uses hard shared experts but routes different tasks to different subsets of experts through learned gates. This provides task specific capacity while still sharing computation. Google reported that MMoE improved YouTube recommendation metrics by allowing watch time and engagement tasks to use different expert subsets while sharing the base representations.
💡 Key Takeaways
•Hard sharing uses one shared backbone for all tasks, reducing model size by 2 to 3 times and minimizing serving latency
•Soft sharing maintains separate backbones per task with regularization to keep weights similar, providing flexibility when tasks need different features
•Production systems default to hard sharing for serving efficiency, only moving to soft sharing when negative transfer is proven
•Mixture of experts (MMoE) combines hard and soft by routing tasks to different expert subsets while sharing base computation
•Memory and compute trade offs are significant: hard sharing at 10.1 million parameters versus soft sharing at 20.1 million parameters for two task example
📌 Examples
YouTube recommendation MMoE: Shared expert layers with per task gates, allowing watch time task to use different experts than engagement task, improving overall AUC by 0.5%
Uber Eats restaurant ranking: Hard sharing for delivery time and click predictions (closely aligned), separate models for cuisine preference (different feature space)
Tesla vision: Hard sharing across detection, depth, segmentation on same camera frames, fits 200MB model within embedded GPU memory