SMOTE: Synthetic Minority Oversampling Technique
How SMOTE Works: SMOTE creates synthetic minority class examples by interpolating between existing minority samples. For each minority example, find its k nearest minority neighbors, randomly select one, and create a new point on the line segment between them. This expands the minority class decision region without exact duplication.
The Algorithm
1. For each minority class sample, find k nearest neighbors (typically k=5) within the minority class. 2. Randomly select one neighbor. 3. Create synthetic sample: new_point = original + random(0,1) × (neighbor - original). 4. Repeat until desired balance ratio is achieved. The synthetic points fill the feature space between real minority examples.
Why Interpolation Works
Simple oversampling (duplicating minority examples) causes overfitting—the model memorizes exact minority points. SMOTE generates novel points that are plausibly minority class, forcing the model to learn the decision boundary rather than memorize specific examples. The feature space between minority samples is assumed to also be minority class.
Warning: SMOTE assumes linear interpolation in feature space produces valid samples. For complex feature distributions (images, text embeddings), this assumption breaks. Synthetic samples may be unrealistic or cross into majority class regions.
SMOTE Variants
Borderline-SMOTE focuses on minority samples near the decision boundary—these are hardest to classify. ADASYN generates more synthetic samples in regions where minority class is underrepresented. SMOTE-NC handles mixed numerical-categorical features. Choose variants based on data characteristics.
Practical Application
Apply SMOTE only to training data, never to validation or test sets. Evaluate on natural distribution. Typical target: 1:1 to 1:3 minority-to-majority ratio, not necessarily perfect balance. Excessive oversampling can introduce noise that hurts generalization.