Synthetic Data Generation for Computer Vision
WHEN TO USE SYNTHETIC DATA
Synthetic data is particularly valuable for: (1) Rare scenarios that are hard to capture naturally (emergency vehicles, unusual weather). (2) Safety critical applications where failure cases must be tested. (3) Domains where real data is expensive or requires privacy protection (medical imaging, autonomous driving). (4) Generating ground truth labels automatically (exact bounding boxes, depth maps).
GENERATION APPROACHES
Game engine simulation: Render 3D scenes with physics engines. Throughput: 100,000+ high-resolution images per hour on 8 GPUs. Includes free labels (depth, segmentation, bounding boxes).
Generative models: GANs or diffusion models synthesize realistic images. Higher quality but slower and harder to control.
Domain randomization: Vary textures, lighting, and object positions aggressively. The real world becomes "just another variation" the model must handle.
THE DOMAIN GAP PROBLEM
Synthetic images differ from real images in subtle ways: perfect lighting, missing sensor noise, unrealistic textures. Models trained heavily on synthetic data can lose 2-10 percentage points accuracy on real validation. Mitigations: include sensor-accurate noise models (rolling shutter, motion blur, lens distortion), mix synthetic and real data (70-30 split), and anneal synthetic ratio during training.
MIXING STRATEGY
Start with 30% synthetic data early in training when the model needs diverse patterns. Anneal to 10% late in training to fine-tune on real data distribution. Monitor validation accuracy on real data throughout.