Natural Language Processing SystemsNamed Entity Recognition (NER)Medium⏱️ ~3 min

Online vs Offline NER Deployment Patterns

Production NER systems serve two fundamentally different workloads with distinct performance requirements. Offline extraction processes large corpora in batch to enrich indexes and knowledge graphs, optimizing for throughput and cost efficiency. Online inference powers real time features like query understanding and content moderation, demanding strict latency guarantees and predictable tail performance. In offline flows, web crawlers or content ingestion services write billions of documents to distributed storage. Batch pipelines read partitions, tokenize, run high capacity transformer models, perform entity linking, and write structured output back to storage. At web scale, processing 1 trillion tokens with a pool of 200 GPUs running at 50 thousand tokens per second per device completes in roughly 28 hours if fully utilized. Teams maximize batch sizes (often 64 to 256 sequences) to saturate GPU memory and compute, trading latency for throughput. The output powers knowledge panels, structured search filters, and deduplication systems. Online systems face severe latency budgets. Query understanding and moderation pipelines allocate 50 to 100 milliseconds total end to end, with NER consuming 2 to 10 milliseconds at the 95th percentile (p95). Teams deploy distilled or quantized models with batch size 1 for predictability, enabling dynamic batching up to 8 requests with a 2 to 5 millisecond batching window during traffic spikes. A single CPU replica serves a few hundred requests per second on short inputs; a GPU replica handles a few thousand. Caching delivers massive gains: repeated queries or messages achieve above 80 percent cache hit rates, reducing average latency below 2 milliseconds for hot keys. The deployment architecture differs fundamentally. Offline systems run scheduled jobs on ephemeral compute, spinning up GPU clusters for batch runs and shutting down between jobs to control cost. Online systems maintain always on stateless microservices behind load balancers in multiple regions, with auto scaling based on request rate and latency Service Level Objectives (SLOs). Shadow deployments and progressive rollouts protect against regressions. Fallback to rule based extraction activates automatically if the model service degrades, maintaining critical compliance features like PII redaction with conservative precision even during outages.
💡 Key Takeaways
Offline batch NER processes trillions of tokens over hours to days, using large batches (64 to 256) to saturate 200+ GPU pools at 10 million tokens per second throughput
Online NER serves query understanding and moderation with p95 latency under 10 milliseconds, using batch size 1 for predictability and dynamic batching up to 8 during spikes
Caching delivers massive latency wins: repeated queries achieve above 80 percent hit rates, reducing average response time below 2 milliseconds for hot keys
Deployment patterns diverge: offline uses ephemeral scheduled jobs that spin down between runs; online maintains always on stateless microservices with multi region replication
Fallback to rule based extraction activates during model service degradation, maintaining critical PII redaction with conservative precision even during outages
📌 Examples
Google Knowledge Graph ingestion: Offline batch NER over web crawls processes billions of documents with 200 GPU pool, completing 1 trillion tokens in under 30 hours
Microsoft Bing query understanding: Online NER serves search queries with p95 latency under 5 milliseconds, using in memory cache with 85% hit rate for frequent queries
Amazon product catalog enrichment: Nightly batch jobs extract attributes from 500 million product titles using large batch inference, updating structured facets by morning
← Back to Named Entity Recognition (NER) Overview
Online vs Offline NER Deployment Patterns | Named Entity Recognition (NER) - System Overflow