Natural Language Processing Systems • Text Classification at ScaleMedium⏱️ ~2 min
Zero Shot vs Supervised Classification Trade-offs
Zero shot classification maps text and label descriptions to the same embedding space, then uses cosine similarity to select labels without requiring labeled training data. You represent each label with one or more natural language descriptions like "This text is about product returns and refunds" for a returns category. Compute similarities between the text embedding and all label embeddings, then choose top k labels above a similarity threshold. This avoids the cost and delay of collecting thousands of labeled examples and accelerates iteration when labels change weekly or monthly.
Platforms like Snowflake expose zero shot classification directly within the data plane. You can pass a text and up to 100 category descriptions in a single call and receive structured outputs at thousands of items per minute without running a dedicated model serving stack. This simplifies downstream integration and reduces operational complexity, making it practical for analysts to classify support tickets or customer feedback without ML engineering resources.
The trade-off is accuracy on nuanced domain labels. Zero shot typically achieves F1 scores of 0.65 to 0.80 on specialized tasks, compared to 0.85 to 0.95 for supervised models fine tuned on task specific data. Label description engineering matters significantly. Poorly worded descriptions can drop accuracy by 10 to 15 percentage points. You also depend on the underlying embedding model's quality and language coverage, which may underperform on domain jargon or low resource languages.
Supervised classification with fine tuned BERT derivatives or sentence transformers delivers the best accuracy and calibration once you have labeled data. Training requires 1,000 to 10,000 examples per label depending on complexity, and retraining cycles run weekly to monthly as data drifts. Inference is cheaper than generative models, typically 10 to 50 milliseconds per item on GPU with dynamic batching, and easier to harden for compliance since you control the model weights and outputs. The downside is MLOps overhead: managing retraining pipelines, versioning, rollback, and model registries.
In practice, many production systems start with zero shot for bootstrapping and rapid experimentation, then transition to supervised models as labeled data accumulates and accuracy requirements tighten. Some systems use zero shot as an escalation path for rare or emerging labels that lack training data, routing high confidence items through supervised models and low confidence items through zero shot with human review.
💡 Key Takeaways
•Zero shot achieves F1 0.65 to 0.80 without training data, supervised fine tuned models reach 0.85 to 0.95 with 1,000 to 10,000 labeled examples per label
•Snowflake platform accepts text and up to 100 category descriptions in single call, returns structured outputs at thousands per minute without dedicated serving
•Label description quality impacts zero shot accuracy by 10 to 15 percentage points, requiring careful prompt engineering and testing
•Supervised models require MLOps for retraining pipelines, versioning, and rollback but deliver better calibration and compliance control
•Hybrid approach uses zero shot for rare or emerging labels and supervised for high volume stable labels, optimizing accuracy and agility
📌 Examples
E-commerce returns classification: Start with zero shot using descriptions like "Customer wants refund" and "Customer reports damaged item", bootstrap to 1,000 labeled tickets per category, then fine tune BERT model to improve F1 from 0.72 to 0.89
Support ticket routing: Use Snowflake zero shot for initial deployment across 50 categories at 5,000 tickets per minute, collect feedback for 2 months, retrain supervised model for top 20 categories handling 80% of volume