Handling Class Imbalance and Long Tail Labels
Why Standard Training Fails
Standard loss functions treat all examples equally. If 95% of training data is "billing," the model achieves 95% accuracy by always predicting billing. The 5% rare cases get ignored because missing them barely hurts the loss. For a security breach detector, this 5% is exactly what you need to catch.
Class Weighting
The simplest fix: weight each class inversely to its frequency. If "billing" has 10,000 examples and "security" has 50, give security examples 200x higher weight in the loss function. Now misclassifying one security example hurts as much as misclassifying 200 billing examples.
Set weight as: class_weight[i] = total_samples / (num_classes × class_count[i])
The Long Tail Problem
With 500+ categories, the tail gets extreme. Top 10 categories cover 80% of traffic. The remaining 490 share 20%, averaging 0.04% each. Even with class weighting, models struggle when a category has fewer than 50 training examples.
Alternative: hierarchical classification. First classify into 20 broad categories (enough examples each), then use category-specific models for fine-grained labels within each.