Serving Text Classification at Scale: Batching, Caching, and Cost

Batching for Throughput
GPUs process single requests at the same cost as batches. Classifying 1 text takes 15ms on GPU. Classifying 32 texts takes 18ms. That is 32x better throughput for 20% more latency. At 10,000 requests per second, batching lets you serve with 10 GPUs instead of 320.
Implementation: Collect incoming requests into a queue. Every 5ms (or when batch hits 32 items), process the batch together. Each original request waits for its result. Maximum added latency is 5ms; average added latency is 2.5ms.
Trade-off: Batching adds tail latency. If your SLA is p99 under 20ms, a 5ms batching window might cause violations. For real-time applications, use smaller windows (1-2ms) with smaller batches (8-16).
Caching Repeated Classifications
Users submit the same or similar texts repeatedly. Support tickets often contain common phrases. Product reviews cluster around similar complaints. Caching classification results can eliminate 20-40% of compute.
Exact match caching: Hash the normalized input text. Store label and confidence with 1-hour TTL. Hit rate: 15-25% for high-volume systems.
Semantic caching: Store embeddings of previously classified texts. For new input, find nearest cached embedding. If similarity exceeds 0.95, return the cached label without running classification. Hit rate: 30-50% with proper similarity threshold.
Cost Optimization Strategies
💡 GPU Cost Math: A100 GPU costs /hour. At 32 requests per 20ms batch, that is 1.6M requests/hour. Cost: /bin/zsh.00000125 per request. With caching hitting 30%, effective cost drops to /bin/zsh.00000087.
Model distillation: Train a smaller model (DistilBERT: 66M params) to mimic your large model (BERT: 110M params). 40% faster inference, 2-3% accuracy loss. For high-volume classification, this trade-off often makes sense.
Quantization: Convert model weights from 32-bit floats to 8-bit integers. 4x smaller model, 2-3x faster inference on supported hardware. Accuracy loss typically under 1%.

💡 Key Takeaways

✓GPU batching: 32 texts in 18ms vs 1 text in 15ms, 32x throughput improvement

✓Batching window adds tail latency: 5ms window = 2.5ms average added latency

✓Exact match caching: 15-25% hit rate; semantic caching: 30-50% hit rate

✓Model distillation: 40% faster inference for 2-3% accuracy loss

✓Quantization: 4x smaller model, 2-3x faster, under 1% accuracy loss

📌 Interview Tips

1Show the batching math: 1 text in 15ms, 32 texts in 18ms = 32x throughput

2Explain semantic caching: store embeddings, return cached label if similarity > 0.95

3Compare distillation vs quantization trade-offs for your accuracy budget

← Back to Text Classification at Scale Overview