Computer Vision SystemsData Augmentation (AutoAugment, Mixup, Synthetic Data)Hard⏱️ ~3 min

Synthetic Data Generation for Computer Vision

WHEN TO USE SYNTHETIC DATA

Synthetic data is particularly valuable for: (1) Rare scenarios that are hard to capture naturally (emergency vehicles, unusual weather). (2) Safety critical applications where failure cases must be tested. (3) Domains where real data is expensive or requires privacy protection (medical imaging, autonomous driving). (4) Generating ground truth labels automatically (exact bounding boxes, depth maps).

GENERATION APPROACHES

Game engine simulation: Render 3D scenes with physics engines. Throughput: 100,000+ high-resolution images per hour on 8 GPUs. Includes free labels (depth, segmentation, bounding boxes).
Generative models: GANs or diffusion models synthesize realistic images. Higher quality but slower and harder to control.
Domain randomization: Vary textures, lighting, and object positions aggressively. The real world becomes "just another variation" the model must handle.

THE DOMAIN GAP PROBLEM

Synthetic images differ from real images in subtle ways: perfect lighting, missing sensor noise, unrealistic textures. Models trained heavily on synthetic data can lose 2-10 percentage points accuracy on real validation. Mitigations: include sensor-accurate noise models (rolling shutter, motion blur, lens distortion), mix synthetic and real data (70-30 split), and anneal synthetic ratio during training.

⚠️ Key Trade-off: Synthetic data provides perfect labels and unlimited rare scenarios, but domain gap requires careful mitigation or you hurt real-world performance.

MIXING STRATEGY

Start with 30% synthetic data early in training when the model needs diverse patterns. Anneal to 10% late in training to fine-tune on real data distribution. Monitor validation accuracy on real data throughout.

💡 Key Takeaways
Synthetic data is valuable for rare scenarios, safety testing, and automatic ground truth labeling
Rendering throughput: 100,000+ images per hour on 8 GPUs with free labels (depth, segmentation, boxes)
Domain gap: models trained heavily on synthetic data lose 2-10 percentage points on real validation
Mixing strategy: start with 30% synthetic, anneal to 10% late in training; include sensor-accurate noise models
📌 Interview Tips
1Explain when synthetic data helps: rare scenarios, safety testing, automatic labels, privacy protection
2Describe domain gap mitigation: sensor noise modeling, domain randomization, mixed training data
3Mention the annealing strategy: more synthetic early for diversity, less late for real distribution tuning
← Back to Data Augmentation (AutoAugment, Mixup, Synthetic Data) Overview