Natural Language Processing Systems • Tokenization & PreprocessingHard⏱️ ~2 min
Production Tokenization: Performance, Caching, and Scale
In production, tokenization must meet strict latency and throughput requirements while maintaining consistency. For online inference, budget tokenization to less than 5 percent of end to end latency. If your target is 200 milliseconds P50 first token latency, keep tokenization under 10 milliseconds P50 and 30 milliseconds P99. For batch indexing, target at least 100,000 tokens per second per CPU core for subword methods.
Consider a chat system serving 1,000 requests per second with 1.5KB prompts using GPT style byte level BPE. That's roughly 60 to 80 million characters per second. At 4 characters per token, you're processing 15 to 20 million tokens per second. If each core delivers 100K tokens per second, you need 150 to 200 cores just for tokenization. This motivates aggressive optimization: native language implementations (Rust or C++), memory mapped read only vocabulary structures, batching within and across requests, and content hash based prompt caching.
Performance cliffs are real. Tokenization that normally takes 1 millisecond can jump to 50 milliseconds on pathological inputs like long numbers, base64 blobs, or repeated character patterns that defeat BPE merge caching. A 20KB base64 encoded image in a prompt can blow P99 latency. Mitigation strategies include early content classification, refusing or rewriting pathological inputs (summarize before tokenize), and setting hard limits on raw input size before tokenization begins.
Chunking and streaming introduce correctness risks. Splitting a large document into chunks for parallel tokenization can cut across subword boundaries, yielding different merges and mismatched token sequences. Use boundary aware chunking with small overlap (8 to 16 bytes) and stitch results using the tokenizer's merge rules. For 100MB documents, process in 1MB chunks with 16 byte overlaps, discard overlap tokens from all but the first chunk, and validate total count matches single pass tokenization on a sample.
💡 Key Takeaways
•Budget tokenization latency strictly. For 200ms P50 inference, keep tokenization under 10ms P50 and 30ms P99. At 1000 QPS with 1.5KB prompts, you need 150 to 200 cores to tokenize 15 to 20 million tokens per second.
•Pathological inputs cause performance cliffs. Base64 blobs, long numbers, or repeated patterns can jump from 1ms to 50ms. Mitigate with early content classification, hard input size limits, and summarization before tokenization.
•Content hash caching for repeated prompts can reduce compute by 40 to 60 percent in production systems with common system messages and templates. Store token ids alongside raw text for immutable corpora.
•Parallel chunking requires boundary aware splitting. Cutting across subword boundaries produces different tokens. Use 8 to 16 byte overlaps, discard overlap from all but first chunk, and validate against single pass tokenization on samples.
•Observability is critical. Emit tokens per second, tokens per request, time per kilobyte, and track P50 and P99 separately. Alert on drift and unknown token rates for word level systems.
📌 Examples
OpenAI ships tiktoken, a fast tokenizer in Rust with Python bindings, achieving 500K+ tokens per second per core for typical prompts. Azure OpenAI mirrors this to ensure consistent billing and context limit enforcement.
A search indexing system processing 100K documents per minute at 2KB each generates 250K tokens per second. With 100K tokens per second per core, 3 cores handle the load. Adding content hash caching for 30 percent duplicate documents reduces this to 2 cores.
A chat application hit P99 latency spikes when users pasted code with long base64 strings. Tokenization jumped from 5ms to 80ms. Fix: Detect base64 patterns, truncate to 1KB with a placeholder, reducing P99 to 12ms and improving user experience.