Natural Language Processing Systems • Text Classification at ScaleEasy⏱️ ~2 min
What is Text Classification at Scale?
Text classification assigns one or more labels to a piece of text, like categorizing a support ticket as urgent, routing an email to spam, or tagging a product description with its category. At scale, production systems process millions to billions of items per day across multiple languages and formats.
Two broad model families dominate production deployments. Representation models map text to numeric vectors (embeddings of 768 to 1024 dimensions), then apply a lightweight classifier like logistic regression or a tree ensemble. This approach supports fast inference, typically 10 to 50 milliseconds per item, and makes retraining cheap since you can update just the classifier head without recomputing embeddings. Generative models produce labels as text directly by conditioning on label descriptions, enabling zero shot or few shot classification without labeled training data, but at the cost of higher latency, often 50 to 300 milliseconds or more per item.
Production systems face three types of classification problems. Multi class assigns exactly one label from a set, like routing a ticket to sales, support, or billing. Multi label assigns multiple labels, such as tagging a news article with politics, economics, and technology. Hierarchical classification leverages taxonomy structure, predicting coarse labels first (electronics) then refined labels (laptops, then gaming laptops), which reduces confusion and provides explainability.
At companies like Google and Meta, text classification underpins critical products. Gmail publicly reports blocking over 99.9 percent of spam with classification pipelines that handle millions of emails per minute at peak, maintaining latency budgets under 50 milliseconds at the 95th percentile. Social platforms run moderation classifiers inline at roughly 6,000 posts per second average, with spikes significantly higher, to gate publishing or prioritize human review.
💡 Key Takeaways
•Representation models generate embeddings then classify, achieving 10 to 50 milliseconds inference with F1 scores of 0.80 to 0.90 on common benchmarks
•Generative models produce labels as text, enabling zero shot classification without training data but requiring 50 to 300 milliseconds or more per item
•Gmail spam classification handles millions of emails per minute with over 99.9 percent accuracy at under 50 milliseconds P95 latency
•Multi class assigns one label, multi label assigns multiple, and hierarchical leverages taxonomy structure to reduce confusion
•Production systems balance accuracy, latency, cost, and operational complexity across billions of daily classifications
📌 Examples
Gmail spam filter: Processes millions of emails per minute, blocks 99.9% of spam, maintains P95 latency under 50ms using tiered architecture with lexical gate and transformer expert
Amazon product categorization: Batch pipeline classifies 50,000 to 200,000 items per minute, uses embeddings with approximate nearest neighbor search for candidate generation
Twitter content moderation: Routes 6,000 posts per second average, escalates 10 to 30 percent to heavy experts or human review, targets under 100ms end to end latency