Natural Language Processing SystemsNamed Entity Recognition (NER)Hard⏱️ ~3 min

Production NER Implementation: Training, Serving, and Monitoring

Creating Quality Training Data

NER model quality depends heavily on training data quality. Annotation guidelines must be unambiguous: annotators need clear rules for what counts as each entity type, how to handle edge cases, and when to mark something as uncertain. Without clear guidelines, different annotators make different decisions, and the model learns inconsistent patterns.

Measure inter-annotator agreement using Cohen's kappa. A score above 0.8 indicates annotators agree consistently. Below 0.7 suggests your guidelines are ambiguous. When annotators disagree, have a senior annotator adjudicate, but also use the disagreement data to improve guidelines. If two competent annotators disagree on whether "Apple" is an organization or should be left unannotated in a specific context, that context is genuinely ambiguous and your guidelines need to address it.

Active Learning for Efficient Annotation

Annotating thousands of documents is expensive. Active learning reduces cost by selecting the most informative examples for annotation. The process: train an initial model on a small labeled set, use that model to find examples where it is most uncertain, annotate those uncertain examples, retrain, and repeat. This focuses annotation effort on the boundaries where the model struggles rather than examples it already handles well.

💡 Key Insight: Active learning can achieve the same model quality with 30-50% less annotation compared to random sampling. The savings come from avoiding annotation of examples the model already classifies correctly with high confidence.

Scaling NER to Large Corpora

Processing billions of documents requires parallelization. Partition documents across workers, run NER independently on each partition, merge results. The process is embarrassingly parallel because documents are independent. The bottleneck shifts from compute to I/O: reading documents, writing extracted entities. Use batch processing frameworks that handle data shuffling efficiently.

For very large scale, consider approximate methods: run expensive NER on a sample, use the results to train a smaller, faster model, deploy the fast model on the full corpus. You trade some accuracy for orders of magnitude speedup.

💡 Key Takeaways
Annotation guidelines must be unambiguous with clear edge case rules; measure inter-annotator agreement with Cohen's kappa (target 0.8+)
Active learning reduces annotation cost by 30-50% by focusing on examples where the model is uncertain rather than random sampling
Large-scale NER is embarrassingly parallel - partition documents across workers, run independently, merge results
For billion-document scale, train a fast model on expensive NER results from a sample, trading accuracy for orders of magnitude speedup
📌 Interview Tips
1Mention Cohen's kappa as your annotation quality metric. Explain that below 0.7 indicates ambiguous guidelines needing revision.
2Describe active learning as a cost-saving technique: find uncertain examples, annotate those, retrain. Quantify the 30-50% savings.
3For scale questions, explain the partition-process-merge pattern and identify I/O as the bottleneck, not compute.
← Back to Named Entity Recognition (NER) Overview