Natural Language Processing SystemsText Classification at ScaleMedium⏱️ ~3 min

Tiered Architecture for Latency and Cost Optimization

Key Insight
Tiered classification uses multiple models of increasing capability and cost. Fast cheap models handle easy cases; slow expensive models handle hard cases. Most requests never hit the expensive tier.

Why Tiers Work

Not all classification requests are equally difficult. "I want to cancel my subscription" is obviously a cancellation request. But "I am thinking about whether this makes sense for my budget" is ambiguous. A simple model classifies the first correctly at 2ms. The second needs a sophisticated model that takes 150ms.

The key insight: 70-80% of real production traffic is easy cases. If you can identify and route them to cheap models, you cut average latency and cost dramatically while maintaining accuracy on hard cases.

Three Tier Architecture

Tier 1 (Rule-based, 0.5ms): Keyword matching and regex patterns. "Cancel subscription" triggers cancellation label. Handles 30-40% of traffic with 99% precision on matched patterns. No ML cost.

Tier 2 (Lightweight ML, 5ms): Distilled BERT or logistic regression on TF-IDF. Handles 40-50% of traffic. 85-90% accuracy. Runs on CPU.

Tier 3 (Full model, 100-200ms): Large language model or fine-tuned transformer. Handles remaining 10-20% of ambiguous cases. 92-95% accuracy. Requires GPU.

Routing Logic

Each tier outputs a confidence score. If Tier 1 matches with confidence above 0.95, return immediately. Otherwise, pass to Tier 2. If Tier 2 confidence is below 0.80, escalate to Tier 3. These thresholds are tuned on validation data to balance accuracy vs cost.

💡 Cost Math: 1M requests/day. Without tiers: all hit GPU at /bin/zsh.001 each = /day. With tiers: 35% rules (/bin/zsh), 45% CPU (/bin/zsh.0001), 20% GPU = /day. Same accuracy, 78% cost reduction.
💡 Key Takeaways
70-80% of classification requests are easy cases that simple models handle correctly
Tier 1 uses rules/regex (0.5ms, 30-40% traffic), Tier 2 uses lightweight ML (5ms, 40-50%), Tier 3 uses full model (100ms, 10-20%)
Confidence thresholds control routing: high confidence returns early, low confidence escalates
Tiered approach can reduce costs by 70-80% while maintaining accuracy on hard cases
Tune thresholds on validation data to balance accuracy vs latency vs cost
📌 Interview Tips
1Describe the three tier pattern: rules for obvious cases, lightweight ML for medium, full model for ambiguous
2Show the cost math: 1M requests with tiers costs 78% less than sending all to GPU
3Explain confidence-based routing: set thresholds based on validation accuracy at each tier
← Back to Text Classification at Scale Overview
Tiered Architecture for Latency and Cost Optimization | Text Classification at Scale - System Overflow