Edge Deployment Failure Modes: Quantization Drift, Thermal Throttling, and NMS Explosions
QUANTIZATION DRIFT
Quantization reduces weights from 32-bit floats to 8-bit integers, saving 4x memory and enabling faster inference. But quantization is not free. Weights get rounded to the nearest representable value, introducing error. For well-behaved models, accuracy drops 0.5-2%. For models with outlier weights or narrow distributions, drops can reach 5-10%. Detection: Compare FP32 and INT8 accuracy on your validation set before deployment. If the gap exceeds 2%, apply quantization-aware training.
THERMAL THROTTLING
Mobile devices throttle CPU/GPU frequency when temperature exceeds thresholds (typically 40-45°C for skin temperature). After 30-60 seconds of continuous inference, performance can drop 30-50%. What worked in development (short tests) fails in production (sustained load). Mitigation: Benchmark sustained performance (5+ minute runs), design for throttled state, or add cooling for embedded systems.
NMS EXPLOSIONS
Object detection uses Non-Maximum Suppression (NMS) to remove duplicate detections. NMS runtime is O(n²) where n is the number of raw detections. Normal scenes produce 50-200 detections (fast). Crowded scenes with many small objects can produce 5,000+ detections, causing NMS to spike from 2ms to 200ms+. Mitigation: Limit maximum detections (top-k before NMS), use batched NMS, or switch to NMS-free architectures like DETR.
MEMORY SPIKES
Intermediate activations can exceed model size by 10-50x for high-resolution inputs. A model using 50MB weights might need 500MB peak memory. On constrained devices, this causes OOM crashes or forces slower swap-based execution.