Production Hardware Aware Optimization: Edge vs Cloud Trade Offs

Edge Deployment Constraints
Edge devices (phones, IoT, embedded) face strict limits: 50-500MB RAM for entire app, 1-4 CPU cores at 1-2GHz, no GPU or weak GPU, battery constraints (2-5W power budget). A model consuming 100MB peak memory may crash the app. Latency requirements are tight: 30-50ms for camera apps, 10ms for audio processing. Edge-optimized models need: INT8 quantization, aggressive pruning (50-70%), and architectures designed for these constraints from scratch. Testing on actual devices is essential; emulators underestimate memory pressure and thermal throttling.
Cloud GPU Constraints
Cloud has abundant compute but cost and latency still matter. A V100 costs $3/hour; inefficient models waste money at scale. Processing 1M daily requests with 10ms overhead costs $30/day in wasted GPU time. Batch processing amortizes overhead but adds latency. Cloud-specific optimizations: use tensor cores (requires FP16/INT8 and aligned shapes), maximize GPU utilization (75%+ target), batch aggressively when latency permits. Memory is less constrained (16-80GB) but still limits maximum batch size and model complexity.
Edge vs Cloud Decision Framework
Choose edge when: privacy requires local processing (medical, financial data); network latency exceeds model latency; offline capability needed; per-inference cost must be zero. Choose cloud when: model complexity exceeds device capability; frequent model updates required (push new model without app update); consistent latency across diverse devices needed; training and inference can share infrastructure.
💡 Hybrid Pattern: Run lightweight model on edge for common cases, fall back to cloud for hard examples. A 10KB decision model can route 80% of requests locally, dramatically reducing cloud costs while handling edge cases with full-power models.

💡 Key Takeaways

✓Edge limits: 50-500MB RAM, 1-4 CPU cores, 2-5W power; latency targets 10-50ms

✓Edge models need INT8 quantization and 50-70% pruning; test on real devices, not emulators

✓Cloud optimization: tensor cores (FP16/INT8), 75%+ GPU utilization, aggressive batching

✓Edge for: privacy, offline, zero per-inference cost; Cloud for: complexity, frequent updates

✓Hybrid: lightweight edge model routes 80% locally, cloud handles hard examples

📌 Interview Tips

1Give specific edge constraints (50-500MB RAM, 10-50ms latency) to show deployment experience

2Mention 75%+ GPU utilization target for cloud - shows awareness of cost optimization

3Describe hybrid routing strategy (10KB model routing 80% locally) for sophisticated architecture

← Back to Hardware-Aware Optimization Overview