Natural Language Processing SystemsTokenization & PreprocessingMedium⏱️ ~2 min

Production Tokenization: Performance, Caching, and Scale

Tokenization Latency

Tokenization happens on every request. At 10,000 QPS with 500-token average inputs, you tokenize 5 million tokens per second. A naive implementation at 1μs per token adds 500μs per request. Optimized implementations hit 10-50ns per token, adding only 5-25μs per request.

For large language model inference where generation takes 100-500ms, tokenization overhead is negligible. For embedding lookups at 5-10ms total latency, tokenization can be 5-10% of your budget. Profile before optimizing, but know the numbers.

Implementation Choices

Rust tokenizers: The standard for production. 10-100× faster than Python implementations. Pre-compiled vocabularies and SIMD operations for parallel processing. Memory-mapped vocabulary files enable instant startup.

Vocabulary loading: A 50,000-token vocabulary with metadata is 5-10MB. Load once at startup, share across threads. Avoid reloading per-request: that adds 10-50ms latency and defeats all optimization.

Caching Tokenized Results

Cache tokenized sequences for repeated inputs. In embedding services, 20-40% of inputs are duplicates or near-duplicates. Hash the normalized input text, store token IDs with TTL. Cache hit avoids both tokenization and potentially model inference.

💡 Cache Design: Key = hash(normalized_text). Value = [token_ids, token_count, special_token_positions]. TTL = 1-24 hours depending on vocabulary update frequency. Size = 100K-1M entries for most services.

Batching for Throughput

Tokenize multiple inputs together. Padding to uniform length enables vectorized operations. With dynamic batching, collect inputs for 5-10ms windows, tokenize as a batch, then dispatch to model inference. Total throughput increases 2-5× over individual processing.

Padding trade-off: longer sequences mean more wasted compute. If one input has 50 tokens and another has 500, you pad the short one to 500. Sort inputs by approximate length before batching to minimize padding waste.

💡 Key Takeaways
Optimized tokenizers hit 10-50ns per token vs 1μs for naive implementations
Rust tokenizers are 10-100× faster than Python with SIMD and memory-mapped vocab
Load vocabulary once at startup, share across threads; per-request loading adds 10-50ms
Cache tokenized results: 20-40% of inputs are duplicates in embedding services
Batch tokenization with dynamic batching increases throughput 2-5×; sort by length to minimize padding
📌 Interview Tips
1Show latency math: 500 tokens at 1μs = 500μs per request; at 10ns = 5μs per request
2Explain caching: hash normalized text, store token IDs with 1-24 hour TTL
3Mention padding trade-off: sort by length before batching to minimize wasted compute
← Back to Tokenization & Preprocessing Overview