Learn→Computer Vision Systems→Edge Deployment (MobileNet, EfficientNet-Lite)→5 of 6

Computer Vision Systems • Edge Deployment (MobileNet, EfficientNet-Lite)Hard⏱️ ~2 min

Edge Deployment Failure Modes: Quantization Drift, Thermal Throttling, and NMS Explosions

QUANTIZATION DRIFT
Quantization reduces weights from 32-bit floats to 8-bit integers, saving 4x memory and enabling faster inference. But quantization is not free. Weights get rounded to the nearest representable value, introducing error. For well-behaved models, accuracy drops 0.5-2%. For models with outlier weights or narrow distributions, drops can reach 5-10%. Detection: Compare FP32 and INT8 accuracy on your validation set before deployment. If the gap exceeds 2%, apply quantization-aware training.
THERMAL THROTTLING
Mobile devices throttle CPU/GPU frequency when temperature exceeds thresholds (typically 40-45°C for skin temperature). After 30-60 seconds of continuous inference, performance can drop 30-50%. What worked in development (short tests) fails in production (sustained load). Mitigation: Benchmark sustained performance (5+ minute runs), design for throttled state, or add cooling for embedded systems.
⚠️ Key Trade-off: Peak performance benchmarks are misleading. Design for sustained (thermally throttled) performance, which can be 30-50% lower.
NMS EXPLOSIONS
Object detection uses Non-Maximum Suppression (NMS) to remove duplicate detections. NMS runtime is O(n²) where n is the number of raw detections. Normal scenes produce 50-200 detections (fast). Crowded scenes with many small objects can produce 5,000+ detections, causing NMS to spike from 2ms to 200ms+. Mitigation: Limit maximum detections (top-k before NMS), use batched NMS, or switch to NMS-free architectures like DETR.
MEMORY SPIKES
Intermediate activations can exceed model size by 10-50x for high-resolution inputs. A model using 50MB weights might need 500MB peak memory. On constrained devices, this causes OOM crashes or forces slower swap-based execution.

💡 Key Takeaways

✓Quantization drift: well-behaved models lose 0.5-2%, outlier weights can lose 5-10%; compare FP32 vs INT8 before deployment

✓Thermal throttling: 30-50% performance drop after 30-60 seconds sustained load; design for throttled state

✓NMS explosions: O(n²) runtime, 50 detections = 2ms, 5000 detections = 200ms+; use top-k limiting

✓Memory spikes: activations can be 10-50x model size at high resolution; can cause OOM crashes

📌 Interview Tips

1Explain quantization drift: compare FP32 and INT8 accuracy, apply quantization-aware training if gap >2%

2Describe thermal throttling: short benchmarks are misleading, test 5+ minute sustained runs

3Mention NMS as a latency trap: O(n²) means crowded scenes explode; limit max detections before NMS

← Back to Edge Deployment (MobileNet, EfficientNet-Lite) Overview