Fraud Detection & Anomaly DetectionHandling Imbalanced Data (SMOTE, Class Weighting, Focal Loss)Medium⏱️ ~2 min

SMOTE: Synthetic Minority Oversampling Technique

How SMOTE Works: SMOTE creates synthetic minority class examples by interpolating between existing minority samples. For each minority example, find its k nearest minority neighbors, randomly select one, and create a new point on the line segment between them. This expands the minority class decision region without exact duplication.

The Algorithm

1. For each minority class sample, find k nearest neighbors (typically k=5) within the minority class. 2. Randomly select one neighbor. 3. Create synthetic sample: new_point = original + random(0,1) × (neighbor - original). 4. Repeat until desired balance ratio is achieved. The synthetic points fill the feature space between real minority examples.

Why Interpolation Works

Simple oversampling (duplicating minority examples) causes overfitting—the model memorizes exact minority points. SMOTE generates novel points that are plausibly minority class, forcing the model to learn the decision boundary rather than memorize specific examples. The feature space between minority samples is assumed to also be minority class.

Warning: SMOTE assumes linear interpolation in feature space produces valid samples. For complex feature distributions (images, text embeddings), this assumption breaks. Synthetic samples may be unrealistic or cross into majority class regions.

SMOTE Variants

Borderline-SMOTE focuses on minority samples near the decision boundary—these are hardest to classify. ADASYN generates more synthetic samples in regions where minority class is underrepresented. SMOTE-NC handles mixed numerical-categorical features. Choose variants based on data characteristics.

Practical Application

Apply SMOTE only to training data, never to validation or test sets. Evaluate on natural distribution. Typical target: 1:1 to 1:3 minority-to-majority ratio, not necessarily perfect balance. Excessive oversampling can introduce noise that hurts generalization.

💡 Key Takeaways
SMOTE interpolates between minority neighbors to create synthetic samples, avoiding overfitting from exact duplication
Assumes linear interpolation produces valid samples—breaks for complex features like images or text embeddings
Apply SMOTE only to training data; target 1:1 to 1:3 ratio, not necessarily perfect balance
📌 Interview Tips
1Algorithm: find k neighbors, select one, create point = original + random(0,1) × (neighbor - original)
2Use Borderline-SMOTE for samples near decision boundary, ADASYN for underrepresented regions
← Back to Handling Imbalanced Data (SMOTE, Class Weighting, Focal Loss) Overview
SMOTE: Synthetic Minority Oversampling Technique | Handling Imbalanced Data (SMOTE, Class Weighting, Focal Loss) - System Overflow