Critical Failure Modes in Hardware Aware Optimization

Benchmark vs Production Gap
The most common failure: model meets benchmarks but fails in production. Causes: benchmarks run on isolated GPU while production shares resources; benchmarks use fixed batch sizes while production has variable load; thermal throttling on sustained load reduces real throughput 20-40%. Prevention: benchmark under realistic conditions, including concurrent workloads, memory pressure from other processes, and sustained operation for 10+ minutes to trigger thermal effects.
Memory Fragmentation Crashes
Model fits in memory during testing but crashes after hours of production use. Dynamic shapes create variable-sized allocations that fragment memory. Over time, total free memory is sufficient but no contiguous block is large enough. Symptoms: sporadic OOM errors despite memory appearing available. Fixes: use fixed tensor sizes where possible, preallocate buffers, restart workers periodically (crude but effective), use memory-pooling allocators.
Operator Divergence Across Hardware
Same model produces different outputs on different hardware due to: different floating-point rounding (CPU vs GPU, Intel vs AMD); operator implementation differences in frameworks; fused kernels computing differently than unfused. Symptoms: accuracy varies 0.5-2% across deployment targets; edge cases fail on some hardware but not others. Prevention: test on actual target hardware, not just similar hardware; use deterministic mode during validation; set explicit numerical tolerances per hardware target.
⚠️ Critical: Accuracy drops on edge devices are often caused by quantization calibration on server hardware. Calibrate on representative edge-like inputs or actual edge hardware when possible.

💡 Key Takeaways

✓Benchmark vs production gap: thermal throttling reduces real throughput 20-40% on sustained load

✓Memory fragmentation: dynamic shapes cause OOM after hours despite sufficient total memory

✓Fixes for fragmentation: fixed tensor sizes, preallocated buffers, periodic worker restarts

✓Operator divergence causes 0.5-2% accuracy variance across hardware targets

✓Calibrate quantization on edge-representative inputs, not just server hardware

📌 Interview Tips

1Describe thermal throttling causing 20-40% throughput drop on sustained load - production-specific insight

2Explain memory fragmentation from dynamic shapes and periodic restart workaround

3Mention operator divergence across hardware (Intel vs AMD, CPU vs GPU) for numerical precision awareness

← Back to Hardware-Aware Optimization Overview