Online vs Offline NER Deployment Patterns
Online NER for Real-Time Systems
In online NER, extraction happens at request time. A user types a search query, the NER model processes it in 10-50ms, extracts entities, and those entities inform what happens next. Maybe the extracted location filters search results. Maybe the extracted product name routes to a specific catalog. The key constraint is latency: users cannot wait seconds for entity extraction.
This latency requirement shapes architectural choices. You need models small enough to run fast but accurate enough to be useful. You typically cannot use the largest, most accurate models because their inference time exceeds acceptable response latency. A model with 95% accuracy that takes 500ms is worse than one with 88% accuracy that takes 20ms if your latency budget is 100ms end-to-end.
Batch NER for Offline Processing
Batch NER processes documents when speed does not matter. You ingest a corpus of 10 million documents overnight, run NER on all of them, store the extracted entities in a database, and serve queries against the pre-extracted data. The extraction latency can be minutes per document because users never wait for it.
This latency flexibility enables different choices. You can use ensemble models that combine multiple NER systems and vote on results. You can run expensive post-processing to improve entity resolution. You can use the largest, most accurate models available because cost per document matters more than speed per document.
Hybrid Architectures
Many systems combine both approaches. Pre-extract entities from your document corpus using batch NER with high-accuracy models. When a user query arrives, run fast online NER on just the query text. Match query entities against the pre-extracted document entities. This way, you get high accuracy on the large corpus (batch) and acceptable latency on the short query (online).