Production Implementation Patterns
Two Stage Retrieval and Ranking
This is the dominant pattern at scale. Stage one (retrieval) is batch or nearline: precompute embeddings, generate candidate sets using approximate nearest neighbors, write to storage. Stage two (ranking) is online: fetch top K candidates (K typically 100 to 1000), apply real-time filters, score with a compact model, return top N. Why split? Retrieval over billions of items is too expensive to do online. But retrieval does not need per request context. Ranking over hundreds of candidates is cheap enough for real-time and benefits hugely from fresh context like current session or device type. Concrete numbers: YouTube might have 1 billion videos. Batch computes embeddings for all videos and top 1000 candidates per user daily. Online ranking scores those 1000 candidates in 20 to 40 milliseconds using session features. Total latency: 5ms candidate fetch plus 30ms ranking equals 35ms, fitting easily in a 150ms page render budget.
Nearline Features: The Middle Ground
Some features need fresher than batch but not full real-time. Nearline means minute level freshness using streaming aggregations. Examples: purchases_last_hour, clicks_in_session, recent_searches.
Implementation: Kafka or Kinesis streams feed aggregation jobs that write to a low latency store every 1 to 5 minutes. Online serving reads these alongside batch features. This bridges the gap: batch provides long term signals (user demographics, historical preferences), nearline provides recent activity, online adds immediate context.
Cascaded Models with Timeouts
Use a sequence of increasingly expensive models with early exits. First model is ultra fast (5ms), catches 80% of cases. Second model is more accurate but slower (30ms), handles edge cases. Each stage has a strict timeout; if exceeded, return the previous stage's result. Example in fraud detection: Stage one uses a simple rule based check (2ms). Stage two uses a tree ensemble (10ms). Stage three uses a neural network (40ms). Total budget is 50ms. If stage three times out, fall back to stage two prediction. This protects tail latency while allowing sophisticated models when time permits.
Admission Control and Load Shedding
When traffic spikes beyond capacity, you must shed load intelligently. Not all requests have equal value. Priority queue: critical paths (payment processing, safety checks) get capacity first. Lower priority paths (analytics, telemetry, non critical recommendations) get shed. Implementation: assign priority scores to request types. When queue depth exceeds threshold or latency breaches Service Level Objectives (SLOs), reject lowest priority requests with HTTP 503. This keeps the system serving high value traffic within SLOs rather than dragging everything into timeout land.
Snapshot Semantics and Atomic Cutover
Batch jobs should never write directly to production tables. Write to versioned snapshots: predictions_v125. Validate the snapshot: check coverage (all expected entities present), sanity checks (no negative scores), A/B test if possible. Only then atomically switch the pointer: predictions_current now points to v125.
This enables instant rollback. Production metrics tank? Flip the pointer back to v124 in seconds. No data corruption, no partial states, no multi hour recovery.
Capacity Planning for Real-time
Concurrency approximately equals queries per second (QPS) multiplied by p99 latency. At 15,000 QPS with 60ms p99, you need 15,000 multiplied by 0.06 equals 900 concurrent request slots. Each instance handles perhaps 50 concurrent requests, so you need 18 instances minimum. Add 30% headroom for bursts and failures: 24 instances. But this is steady state. Autoscaling lags 2 to 5 minutes. If traffic doubles in 30 seconds (common during events), you will violate SLOs until scale up completes. Solution: keep warm pools at 150% of predicted peak, or use predictive scaling that anticipates traffic patterns.