How MobileNet Achieves 8x Faster Inference with Depthwise Separable Convolutions
THE DEPTHWISE SEPARABLE TRICK
Standard convolution applies a 3x3 filter across all input channels simultaneously. For 64 input channels and 128 output channels, this requires 3×3×64×128 = 73,728 parameters per layer. Depthwise separable convolution splits this into two steps: (1) apply a separate 3x3 filter to each input channel (64×9 = 576 parameters), (2) use a 1x1 convolution to combine channels (64×128 = 8,192 parameters). Total: 8,768 parameters, roughly 8x fewer.
WHY THIS WORKS
Standard convolution learns spatial patterns (edges, textures) and channel interactions (color combinations) simultaneously. Depthwise separable assumes these are separable: learn spatial patterns first, then combine across channels. This assumption holds surprisingly well for visual features. The accuracy cost is typically 1-2 percentage points for 8x fewer operations.
MOBILENET V1 VS V2
MobileNetV1: Stacks depthwise separable convolutions directly. Simple but information can be lost through ReLU activations.
MobileNetV2: Adds inverted residuals with linear bottlenecks. Expands channels, applies depthwise conv, then projects back. The linear (no ReLU) projection preserves information. V2 is 35% more accurate than V1 at similar latency.
WIDTH AND RESOLUTION MULTIPLIERS
MobileNet includes two knobs: width multiplier (α) scales channel counts by 0.25-1.0, and resolution multiplier scales input size (128-224). These allow trading accuracy for speed on specific hardware targets.