Production Tokenization: Performance, Caching, and Scale
Tokenization Latency
Tokenization happens on every request. At 10,000 QPS with 500-token average inputs, you tokenize 5 million tokens per second. A naive implementation at 1μs per token adds 500μs per request. Optimized implementations hit 10-50ns per token, adding only 5-25μs per request.
For large language model inference where generation takes 100-500ms, tokenization overhead is negligible. For embedding lookups at 5-10ms total latency, tokenization can be 5-10% of your budget. Profile before optimizing, but know the numbers.
Implementation Choices
Rust tokenizers: The standard for production. 10-100× faster than Python implementations. Pre-compiled vocabularies and SIMD operations for parallel processing. Memory-mapped vocabulary files enable instant startup.
Vocabulary loading: A 50,000-token vocabulary with metadata is 5-10MB. Load once at startup, share across threads. Avoid reloading per-request: that adds 10-50ms latency and defeats all optimization.
Caching Tokenized Results
Cache tokenized sequences for repeated inputs. In embedding services, 20-40% of inputs are duplicates or near-duplicates. Hash the normalized input text, store token IDs with TTL. Cache hit avoids both tokenization and potentially model inference.
Batching for Throughput
Tokenize multiple inputs together. Padding to uniform length enables vectorized operations. With dynamic batching, collect inputs for 5-10ms windows, tokenize as a batch, then dispatch to model inference. Total throughput increases 2-5× over individual processing.
Padding trade-off: longer sequences mean more wasted compute. If one input has 50 tokens and another has 500, you pad the short one to 500. Sort inputs by approximate length before batching to minimize padding waste.