Critical Failure Modes in Hardware Aware Optimization
Benchmark vs Production Gap
The most common failure: model meets benchmarks but fails in production. Causes: benchmarks run on isolated GPU while production shares resources; benchmarks use fixed batch sizes while production has variable load; thermal throttling on sustained load reduces real throughput 20-40%. Prevention: benchmark under realistic conditions, including concurrent workloads, memory pressure from other processes, and sustained operation for 10+ minutes to trigger thermal effects.
Memory Fragmentation Crashes
Model fits in memory during testing but crashes after hours of production use. Dynamic shapes create variable-sized allocations that fragment memory. Over time, total free memory is sufficient but no contiguous block is large enough. Symptoms: sporadic OOM errors despite memory appearing available. Fixes: use fixed tensor sizes where possible, preallocate buffers, restart workers periodically (crude but effective), use memory-pooling allocators.
Operator Divergence Across Hardware
Same model produces different outputs on different hardware due to: different floating-point rounding (CPU vs GPU, Intel vs AMD); operator implementation differences in frameworks; fused kernels computing differently than unfused. Symptoms: accuracy varies 0.5-2% across deployment targets; edge cases fail on some hardware but not others. Prevention: test on actual target hardware, not just similar hardware; use deterministic mode during validation; set explicit numerical tolerances per hardware target.