Serving Text Classification at Scale: Batching, Caching, and Cost
Batching for Throughput
GPUs process single requests at the same cost as batches. Classifying 1 text takes 15ms on GPU. Classifying 32 texts takes 18ms. That is 32x better throughput for 20% more latency. At 10,000 requests per second, batching lets you serve with 10 GPUs instead of 320.
Implementation: Collect incoming requests into a queue. Every 5ms (or when batch hits 32 items), process the batch together. Each original request waits for its result. Maximum added latency is 5ms; average added latency is 2.5ms.
Trade-off: Batching adds tail latency. If your SLA is p99 under 20ms, a 5ms batching window might cause violations. For real-time applications, use smaller windows (1-2ms) with smaller batches (8-16).
Caching Repeated Classifications
Users submit the same or similar texts repeatedly. Support tickets often contain common phrases. Product reviews cluster around similar complaints. Caching classification results can eliminate 20-40% of compute.
Exact match caching: Hash the normalized input text. Store label and confidence with 1-hour TTL. Hit rate: 15-25% for high-volume systems.
Semantic caching: Store embeddings of previously classified texts. For new input, find nearest cached embedding. If similarity exceeds 0.95, return the cached label without running classification. Hit rate: 30-50% with proper similarity threshold.
Cost Optimization Strategies
Model distillation: Train a smaller model (DistilBERT: 66M params) to mimic your large model (BERT: 110M params). 40% faster inference, 2-3% accuracy loss. For high-volume classification, this trade-off often makes sense.
Quantization: Convert model weights from 32-bit floats to 8-bit integers. 4x smaller model, 2-3x faster inference on supported hardware. Accuracy loss typically under 1%.