Computer Vision SystemsEdge Deployment (MobileNet, EfficientNet-Lite)Hard⏱️ ~2 min

Edge Deployment Failure Modes: Quantization Drift, Thermal Throttling, and NMS Explosions

QUANTIZATION DRIFT

Quantization reduces weights from 32-bit floats to 8-bit integers, saving 4x memory and enabling faster inference. But quantization is not free. Weights get rounded to the nearest representable value, introducing error. For well-behaved models, accuracy drops 0.5-2%. For models with outlier weights or narrow distributions, drops can reach 5-10%. Detection: Compare FP32 and INT8 accuracy on your validation set before deployment. If the gap exceeds 2%, apply quantization-aware training.

THERMAL THROTTLING

Mobile devices throttle CPU/GPU frequency when temperature exceeds thresholds (typically 40-45°C for skin temperature). After 30-60 seconds of continuous inference, performance can drop 30-50%. What worked in development (short tests) fails in production (sustained load). Mitigation: Benchmark sustained performance (5+ minute runs), design for throttled state, or add cooling for embedded systems.

⚠️ Key Trade-off: Peak performance benchmarks are misleading. Design for sustained (thermally throttled) performance, which can be 30-50% lower.

NMS EXPLOSIONS

Object detection uses Non-Maximum Suppression (NMS) to remove duplicate detections. NMS runtime is O(n²) where n is the number of raw detections. Normal scenes produce 50-200 detections (fast). Crowded scenes with many small objects can produce 5,000+ detections, causing NMS to spike from 2ms to 200ms+. Mitigation: Limit maximum detections (top-k before NMS), use batched NMS, or switch to NMS-free architectures like DETR.

MEMORY SPIKES

Intermediate activations can exceed model size by 10-50x for high-resolution inputs. A model using 50MB weights might need 500MB peak memory. On constrained devices, this causes OOM crashes or forces slower swap-based execution.

💡 Key Takeaways
Quantization drift: well-behaved models lose 0.5-2%, outlier weights can lose 5-10%; compare FP32 vs INT8 before deployment
Thermal throttling: 30-50% performance drop after 30-60 seconds sustained load; design for throttled state
NMS explosions: O(n²) runtime, 50 detections = 2ms, 5000 detections = 200ms+; use top-k limiting
Memory spikes: activations can be 10-50x model size at high resolution; can cause OOM crashes
📌 Interview Tips
1Explain quantization drift: compare FP32 and INT8 accuracy, apply quantization-aware training if gap >2%
2Describe thermal throttling: short benchmarks are misleading, test 5+ minute sustained runs
3Mention NMS as a latency trap: O(n²) means crowded scenes explode; limit max detections before NMS
← Back to Edge Deployment (MobileNet, EfficientNet-Lite) Overview
Edge Deployment Failure Modes: Quantization Drift, Thermal Throttling, and NMS Explosions | Edge Deployment (MobileNet, EfficientNet-Lite) - System Overflow