Fraud Detection & Anomaly DetectionUnsupervised Anomaly Detection (Isolation Forest, Autoencoders)Easy⏱️ ~2 min

What is Unsupervised Anomaly Detection?

Definition
Unsupervised anomaly detection identifies unusual data points without labeled examples of anomalies. The model learns what "normal" looks like from unlabeled data, then flags anything that deviates significantly from that normal pattern.

Why Unsupervised

Labeled anomalies are expensive or impossible to obtain. Fraud detection has labels (chargebacks), but manufacturing defect detection, network intrusion detection, and novel attack identification often lack labeled examples. You cannot label what you have never seen before.

Even when labels exist, they may be delayed (chargebacks take 30-90 days) or incomplete (only caught fraud gets labeled). Unsupervised methods detect anomalies from day one without waiting for label collection.

The Core Assumption

All unsupervised anomaly detection rests on one assumption: anomalies are rare and different. If 99% of your data follows certain patterns, the 1% that differs is anomalous. This breaks when anomalies are common (contaminated training data) or when normal data has high variance (everything looks different).

⚠️ Key Limitation: Unsupervised methods find statistical outliers, not necessarily harmful anomalies. A legitimate user with unusual behavior gets flagged alongside actual fraud. Human review or downstream rules must separate true threats from false alarms.

Two Main Approaches

Distance-based: Anomalies are far from normal points. Compute distance to nearest neighbors or cluster centers. Isolation Forest and LOF (Local Outlier Factor) fall here.

Reconstruction-based: Train a model to compress and reconstruct normal data. Anomalies reconstruct poorly because the model never learned their patterns. Autoencoders are the primary example.

💡 Key Takeaways
Unsupervised detection learns 'normal' from unlabeled data, flags significant deviations
Use when labels are unavailable, delayed (30-90 days), or incomplete
Core assumption: anomalies are rare and different; breaks with contaminated data
Distance-based: anomalies far from normal (Isolation Forest, LOF)
Reconstruction-based: anomalies reconstruct poorly (Autoencoders)
📌 Interview Tips
1Explain when to use unsupervised: no labels, delayed labels, or novel unknown anomalies
2Mention the core assumption: anomalies must be rare and different from normal
3Distinguish distance-based vs reconstruction-based approaches
← Back to Unsupervised Anomaly Detection (Isolation Forest, Autoencoders) Overview