Natural Language Processing SystemsText Classification at ScaleMedium⏱️ ~3 min

Tiered Architecture for Latency and Cost Optimization

A tiered architecture uses a cheap gate to filter out easy cases, routing uncertain items to heavier expert models. This pattern dramatically reduces average latency and serving cost while maintaining high accuracy. The gate is typically a fast model like Term Frequency Inverse Document Frequency (TF IDF) with logistic regression that runs on CPU in 10 microseconds to 1 millisecond. It handles 70 to 90 percent of traffic with high confidence predictions, passing only 10 to 30 percent of uncertain cases to an expensive transformer based expert running on GPU. Consider Gmail spam classification at millions of emails per minute. The lexical gate uses TF IDF features with keyword rules to reject obvious spam in under 1 millisecond CPU time per email. This catches bulk spam with sender reputation signals and known patterns. The remaining 10 to 30 percent goes to a distilled BERT encoder running on GPU with dynamic batching. A single modern GPU can embed 1,000 to 3,000 short emails per second at sequence lengths around 128 tokens with batch sizes of 32 to 128. Dynamic batching adds 2 to 8 milliseconds of queuing delay but lifts throughput by 2 to 5 times. The classifier head runs on CPU in under 1 millisecond. This keeps end to end median latency under 20 milliseconds and P99 under 60 milliseconds. The key trade-off is accuracy versus cost. If the gate is too aggressive, false negatives slip through and overall recall drops. If the gate is too conservative, it routes too much traffic to the expensive expert and serving costs balloon. You tune gate thresholds on a validation set to hit a target expert routing rate, typically 10 to 30 percent, while maintaining acceptable recall. Monitor per tier metrics separately: gate precision, gate recall, expert precision, expert recall, and blended system metrics. Dynamic batching is critical for GPU utilization. Without batching, a GPU processes one item at a time and sits idle between requests, wasting 80 to 95 percent of compute. Dynamic batching collects requests over a small time window, 2 to 10 milliseconds, then processes them together. Batch sizes of 32 to 128 achieve near peak throughput. The trade-off is tail latency: items arriving at the start of the window wait longer. Apply timeout based flush to prevent excessive P99 spikes under low traffic, and separate latency critical tenants onto dedicated instances. For offline batch workloads like product categorization, the priority shifts from latency to throughput and cost per million items. A batch pipeline generates embeddings for millions of items overnight, performs candidate generation through approximate nearest neighbor search over class prototypes, then re-ranks with a fine tuned model. Throughput targets are 50,000 to 200,000 items per minute on a small cluster. Use mixed precision or 8 bit quantization to boost throughput by 2 to 4 times with minimal accuracy loss, typically under 1 percent F1 drop.
💡 Key Takeaways
Fast gate handles 70 to 90 percent of traffic in under 1 millisecond on CPU, routes 10 to 30 percent uncertain cases to GPU expert, reducing average cost by 3 to 5 times
Dynamic batching with batch sizes 32 to 128 boosts GPU throughput by 2 to 5 times but adds 2 to 8 milliseconds queuing delay impacting P99 latency
Gmail spam architecture: Lexical gate filters obvious spam at 1 millisecond, transformer expert processes 1,000 to 3,000 emails per second per GPU, maintains P99 under 60 milliseconds
Batch pipelines prioritize throughput over latency, achieving 50,000 to 200,000 items per minute, using mixed precision or 8 bit quantization for 2 to 4 times speedup
Tune gate thresholds on validation set to balance expert routing rate with system recall, monitor per tier metrics separately to diagnose accuracy versus cost trade-offs
📌 Examples
Gmail spam filter: TF IDF gate rejects 80% of obvious spam in <1ms CPU, passes 20% to distilled BERT on GPU batched at 64 items, achieves 20ms median and 60ms P99 latency
Product categorization pipeline: Generate embeddings for 5M items overnight at 100K items/min, use approximate nearest neighbor for candidate generation, re-rank with fine tuned model, total cost $200 per million items
← Back to Text Classification at Scale Overview
Tiered Architecture for Latency and Cost Optimization | Text Classification at Scale - System Overflow