Learn→Computer Vision Systems→Edge Deployment (MobileNet, EfficientNet-Lite)→2 of 6

Computer Vision Systems • Edge Deployment (MobileNet, EfficientNet-Lite)Medium⏱️ ~2 min

How MobileNet Achieves 8x Faster Inference with Depthwise Separable Convolutions

THE DEPTHWISE SEPARABLE TRICK
Standard convolution applies a 3x3 filter across all input channels simultaneously. For 64 input channels and 128 output channels, this requires 3×3×64×128 = 73,728 parameters per layer. Depthwise separable convolution splits this into two steps: (1) apply a separate 3x3 filter to each input channel (64×9 = 576 parameters), (2) use a 1x1 convolution to combine channels (64×128 = 8,192 parameters). Total: 8,768 parameters, roughly 8x fewer.
WHY THIS WORKS
Standard convolution learns spatial patterns (edges, textures) and channel interactions (color combinations) simultaneously. Depthwise separable assumes these are separable: learn spatial patterns first, then combine across channels. This assumption holds surprisingly well for visual features. The accuracy cost is typically 1-2 percentage points for 8x fewer operations.
MOBILENET V1 VS V2
MobileNetV1: Stacks depthwise separable convolutions directly. Simple but information can be lost through ReLU activations.
MobileNetV2: Adds inverted residuals with linear bottlenecks. Expands channels, applies depthwise conv, then projects back. The linear (no ReLU) projection preserves information. V2 is 35% more accurate than V1 at similar latency.
⚠️ Key Trade-off: MobileNet trades a small accuracy loss (1-2 percentage points) for 8x fewer operations, making real-time inference possible on mobile CPUs.
WIDTH AND RESOLUTION MULTIPLIERS
MobileNet includes two knobs: width multiplier (α) scales channel counts by 0.25-1.0, and resolution multiplier scales input size (128-224). These allow trading accuracy for speed on specific hardware targets.

💡 Key Takeaways

✓Depthwise separable: split 3x3 conv into depthwise (per-channel) + 1x1 pointwise, reducing ops 8x

✓Parameter reduction: 73,728 → 8,768 for 64→128 channel layer, with 1-2 percentage point accuracy cost

✓MobileNetV2 adds inverted residuals with linear bottlenecks, 35% more accurate than V1 at similar latency

✓Width (0.25-1.0) and resolution (128-224) multipliers tune accuracy vs speed for specific hardware

📌 Interview Tips

1Walk through the parameter math: standard 3×3×64×128 = 73,728 vs depthwise separable 576+8,192 = 8,768

2Explain the separability assumption: spatial patterns and channel interactions can be learned separately

3Mention width/resolution multipliers as tuning knobs for different hardware targets

← Back to Edge Deployment (MobileNet, EfficientNet-Lite) Overview