Learn→Natural Language Processing Systems→Text Classification at Scale→3 of 6

Natural Language Processing Systems • Text Classification at ScaleMedium⏱️ ~3 min

Tiered Architecture for Latency and Cost Optimization

Key Insight
Tiered classification uses multiple models of increasing capability and cost. Fast cheap models handle easy cases; slow expensive models handle hard cases. Most requests never hit the expensive tier.
Why Tiers Work
Not all classification requests are equally difficult. "I want to cancel my subscription" is obviously a cancellation request. But "I am thinking about whether this makes sense for my budget" is ambiguous. A simple model classifies the first correctly at 2ms. The second needs a sophisticated model that takes 150ms.
The key insight: 70-80% of real production traffic is easy cases. If you can identify and route them to cheap models, you cut average latency and cost dramatically while maintaining accuracy on hard cases.
Three Tier Architecture
Tier 1 (Rule-based, 0.5ms): Keyword matching and regex patterns. "Cancel subscription" triggers cancellation label. Handles 30-40% of traffic with 99% precision on matched patterns. No ML cost.
Tier 2 (Lightweight ML, 5ms): Distilled BERT or logistic regression on TF-IDF. Handles 40-50% of traffic. 85-90% accuracy. Runs on CPU.
Tier 3 (Full model, 100-200ms): Large language model or fine-tuned transformer. Handles remaining 10-20% of ambiguous cases. 92-95% accuracy. Requires GPU.
Routing Logic
Each tier outputs a confidence score. If Tier 1 matches with confidence above 0.95, return immediately. Otherwise, pass to Tier 2. If Tier 2 confidence is below 0.80, escalate to Tier 3. These thresholds are tuned on validation data to balance accuracy vs cost.
💡 Cost Math: 1M requests/day. Without tiers: all hit GPU at /bin/zsh.001 each = /day. With tiers: 35% rules (/bin/zsh), 45% CPU (/bin/zsh.0001), 20% GPU = /day. Same accuracy, 78% cost reduction.

💡 Key Takeaways

✓70-80% of classification requests are easy cases that simple models handle correctly

✓Tier 1 uses rules/regex (0.5ms, 30-40% traffic), Tier 2 uses lightweight ML (5ms, 40-50%), Tier 3 uses full model (100ms, 10-20%)

✓Confidence thresholds control routing: high confidence returns early, low confidence escalates

✓Tiered approach can reduce costs by 70-80% while maintaining accuracy on hard cases

✓Tune thresholds on validation data to balance accuracy vs latency vs cost

📌 Interview Tips

1Describe the three tier pattern: rules for obvious cases, lightweight ML for medium, full model for ambiguous

2Show the cost math: 1M requests with tiers costs 78% less than sending all to GPU

3Explain confidence-based routing: set thresholds based on validation accuracy at each tier

← Back to Text Classification at Scale Overview