Tiered Architecture for Latency and Cost Optimization
Why Tiers Work
Not all classification requests are equally difficult. "I want to cancel my subscription" is obviously a cancellation request. But "I am thinking about whether this makes sense for my budget" is ambiguous. A simple model classifies the first correctly at 2ms. The second needs a sophisticated model that takes 150ms.
The key insight: 70-80% of real production traffic is easy cases. If you can identify and route them to cheap models, you cut average latency and cost dramatically while maintaining accuracy on hard cases.
Three Tier Architecture
Tier 1 (Rule-based, 0.5ms): Keyword matching and regex patterns. "Cancel subscription" triggers cancellation label. Handles 30-40% of traffic with 99% precision on matched patterns. No ML cost.
Tier 2 (Lightweight ML, 5ms): Distilled BERT or logistic regression on TF-IDF. Handles 40-50% of traffic. 85-90% accuracy. Runs on CPU.
Tier 3 (Full model, 100-200ms): Large language model or fine-tuned transformer. Handles remaining 10-20% of ambiguous cases. 92-95% accuracy. Requires GPU.
Routing Logic
Each tier outputs a confidence score. If Tier 1 matches with confidence above 0.95, return immediately. Otherwise, pass to Tier 2. If Tier 2 confidence is below 0.80, escalate to Tier 3. These thresholds are tuned on validation data to balance accuracy vs cost.