Natural Language Processing SystemsNamed Entity Recognition (NER)Medium⏱️ ~3 min

Online vs Offline NER Deployment Patterns

Online NER for Real-Time Systems

In online NER, extraction happens at request time. A user types a search query, the NER model processes it in 10-50ms, extracts entities, and those entities inform what happens next. Maybe the extracted location filters search results. Maybe the extracted product name routes to a specific catalog. The key constraint is latency: users cannot wait seconds for entity extraction.

This latency requirement shapes architectural choices. You need models small enough to run fast but accurate enough to be useful. You typically cannot use the largest, most accurate models because their inference time exceeds acceptable response latency. A model with 95% accuracy that takes 500ms is worse than one with 88% accuracy that takes 20ms if your latency budget is 100ms end-to-end.

Batch NER for Offline Processing

Batch NER processes documents when speed does not matter. You ingest a corpus of 10 million documents overnight, run NER on all of them, store the extracted entities in a database, and serve queries against the pre-extracted data. The extraction latency can be minutes per document because users never wait for it.

This latency flexibility enables different choices. You can use ensemble models that combine multiple NER systems and vote on results. You can run expensive post-processing to improve entity resolution. You can use the largest, most accurate models available because cost per document matters more than speed per document.

⚠️ Key Trade-off: Online NER requires smaller, faster models (10-50ms latency) with lower accuracy. Batch NER can use larger, slower models with higher accuracy. The choice depends on whether users wait for extraction or query pre-extracted data.

Hybrid Architectures

Many systems combine both approaches. Pre-extract entities from your document corpus using batch NER with high-accuracy models. When a user query arrives, run fast online NER on just the query text. Match query entities against the pre-extracted document entities. This way, you get high accuracy on the large corpus (batch) and acceptable latency on the short query (online).

💡 Key Takeaways
Online NER extracts entities at request time in 10-50ms, requiring smaller models that sacrifice accuracy for speed
Batch NER processes documents offline where latency does not matter, enabling ensemble models and expensive post-processing
A 95% accurate model at 500ms is worse than 88% at 20ms if your latency budget is 100ms end-to-end
Hybrid architectures batch-process the corpus with high-accuracy models and online-process queries with fast models
📌 Interview Tips
1Frame online vs batch as a latency trade-off. Ask about the use case: do users wait for extraction, or query pre-extracted data?
2Mention specific numbers - 10-50ms for online NER, minutes per document acceptable for batch. This shows you understand production constraints.
3Describe the hybrid pattern: batch NER on documents, online NER on queries, match entities between them.
← Back to Named Entity Recognition (NER) Overview