What is Unsupervised Anomaly Detection?
Why Unsupervised
Labeled anomalies are expensive or impossible to obtain. Fraud detection has labels (chargebacks), but manufacturing defect detection, network intrusion detection, and novel attack identification often lack labeled examples. You cannot label what you have never seen before.
Even when labels exist, they may be delayed (chargebacks take 30-90 days) or incomplete (only caught fraud gets labeled). Unsupervised methods detect anomalies from day one without waiting for label collection.
The Core Assumption
All unsupervised anomaly detection rests on one assumption: anomalies are rare and different. If 99% of your data follows certain patterns, the 1% that differs is anomalous. This breaks when anomalies are common (contaminated training data) or when normal data has high variance (everything looks different).
Two Main Approaches
Distance-based: Anomalies are far from normal points. Compute distance to nearest neighbors or cluster centers. Isolation Forest and LOF (Local Outlier Factor) fall here.
Reconstruction-based: Train a model to compress and reconstruct normal data. Anomalies reconstruct poorly because the model never learned their patterns. Autoencoders are the primary example.