NER Model Architecture Trade-offs: Rules, CRFs, Transformers, and LLMs
Rule-Based NER
The simplest NER approach uses pattern matching: regular expressions for phone numbers and emails, lookup tables (gazetteers) for known entity names, hand-written rules for specific formats. A rule might say: any sequence of capitalized words followed by Inc, Corp, or LLC is likely an organization. This works surprisingly well for well-defined patterns and runs extremely fast, typically under 1 millisecond per document.
The limitation is coverage. Rules only catch patterns you anticipated. A new company name with unusual capitalization, a phone number in an unexpected format, a person's name you did not include in your list - all of these slip through. Rule-based systems also require constant maintenance as new patterns emerge, and they cannot learn from data.
Statistical and Neural Models
Machine learning models learn entity patterns from labeled training data. Traditional approaches like Conditional Random Fields (CRFs) model the sequential nature of text, understanding that the word after "Mr." is likely a person name. Modern transformer-based models like BERT go further, using deep contextual understanding to distinguish "Apple the company" from "apple the fruit" based on surrounding words.
These models generalize to unseen entities. A neural model trained on one set of company names can recognize new companies it has never seen, based on contextual patterns. The trade-off is computational cost: transformer models require 10-100x more computation than rule-based systems, and they need substantial labeled training data (typically 10,000+ annotated examples for good performance).
Choosing an Approach
Start with rules for clearly structured entities. Use neural models when context matters for disambiguation. Measure precision and recall separately: rules typically have high precision but low recall, neural models have more balanced metrics. Combine them when you need both.