ML Model OptimizationHardware-Aware OptimizationHard⏱️ ~3 min

Critical Failure Modes in Hardware Aware Optimization

Benchmark vs Production Gap

The most common failure: model meets benchmarks but fails in production. Causes: benchmarks run on isolated GPU while production shares resources; benchmarks use fixed batch sizes while production has variable load; thermal throttling on sustained load reduces real throughput 20-40%. Prevention: benchmark under realistic conditions, including concurrent workloads, memory pressure from other processes, and sustained operation for 10+ minutes to trigger thermal effects.

Memory Fragmentation Crashes

Model fits in memory during testing but crashes after hours of production use. Dynamic shapes create variable-sized allocations that fragment memory. Over time, total free memory is sufficient but no contiguous block is large enough. Symptoms: sporadic OOM errors despite memory appearing available. Fixes: use fixed tensor sizes where possible, preallocate buffers, restart workers periodically (crude but effective), use memory-pooling allocators.

Operator Divergence Across Hardware

Same model produces different outputs on different hardware due to: different floating-point rounding (CPU vs GPU, Intel vs AMD); operator implementation differences in frameworks; fused kernels computing differently than unfused. Symptoms: accuracy varies 0.5-2% across deployment targets; edge cases fail on some hardware but not others. Prevention: test on actual target hardware, not just similar hardware; use deterministic mode during validation; set explicit numerical tolerances per hardware target.

⚠️ Critical: Accuracy drops on edge devices are often caused by quantization calibration on server hardware. Calibrate on representative edge-like inputs or actual edge hardware when possible.
💡 Key Takeaways
Benchmark vs production gap: thermal throttling reduces real throughput 20-40% on sustained load
Memory fragmentation: dynamic shapes cause OOM after hours despite sufficient total memory
Fixes for fragmentation: fixed tensor sizes, preallocated buffers, periodic worker restarts
Operator divergence causes 0.5-2% accuracy variance across hardware targets
Calibrate quantization on edge-representative inputs, not just server hardware
📌 Interview Tips
1Describe thermal throttling causing 20-40% throughput drop on sustained load - production-specific insight
2Explain memory fragmentation from dynamic shapes and periodic restart workaround
3Mention operator divergence across hardware (Intel vs AMD, CPU vs GPU) for numerical precision awareness
← Back to Hardware-Aware Optimization Overview