Production Implementation Patterns

Two Stage Retrieval and Ranking
This is the dominant pattern at scale. Stage one (retrieval) is batch or nearline: precompute embeddings, generate candidate sets using approximate nearest neighbors, write to storage. Stage two (ranking) is online: fetch top K candidates (K typically 100 to 1000), apply real-time filters, score with a compact model, return top N.

Why split? Retrieval over billions of items is too expensive to do online. But retrieval does not need per request context. Ranking over hundreds of candidates is cheap enough for real-time and benefits hugely from fresh context like current session or device type.

Concrete numbers: YouTube might have 1 billion videos. Batch computes embeddings for all videos and top 1000 candidates per user daily. Online ranking scores those 1000 candidates in 20 to 40 milliseconds using session features. Total latency: 5ms candidate fetch plus 30ms ranking equals 35ms, fitting easily in a 150ms page render budget.
1
Batch candidate generation: Compute embeddings and approximate nearest neighbors for billions of items, write top 1000 per user to storage.
2
Online candidate fetch: Read precomputed list from Redis or similar, typically under 5ms.
3
Real-time ranking: Score candidates with compact model and fresh features in 20 to 40ms.
4
Apply filters and return: Remove watched items, diversity filters, return final top N.
Nearline Features: The Middle Ground
Some features need fresher than batch but not full real-time. Nearline means minute level freshness using streaming aggregations. Examples: purchases_last_hour, clicks_in_session, recent_searches.

Implementation: Kafka or Kinesis streams feed aggregation jobs that write to a low latency store every 1 to 5 minutes. Online serving reads these alongside batch features. This bridges the gap: batch provides long term signals (user demographics, historical preferences), nearline provides recent activity, online adds immediate context.
Cascaded Models with Timeouts
Use a sequence of increasingly expensive models with early exits. First model is ultra fast (5ms), catches 80% of cases. Second model is more accurate but slower (30ms), handles edge cases. Each stage has a strict timeout; if exceeded, return the previous stage's result.

Example in fraud detection: Stage one uses a simple rule based check (2ms). Stage two uses a tree ensemble (10ms). Stage three uses a neural network (40ms). Total budget is 50ms. If stage three times out, fall back to stage two prediction. This protects tail latency while allowing sophisticated models when time permits.
Admission Control and Load Shedding
When traffic spikes beyond capacity, you must shed load intelligently. Not all requests have equal value. Priority queue: critical paths (payment processing, safety checks) get capacity first. Lower priority paths (analytics, telemetry, non critical recommendations) get shed.

Implementation: assign priority scores to request types. When queue depth exceeds threshold or latency breaches Service Level Objectives (SLOs), reject lowest priority requests with HTTP 503. This keeps the system serving high value traffic within SLOs rather than dragging everything into timeout land.
"The goal is not to serve every request. The goal is to serve high value requests within SLO. Shed the tail to protect the head."
Snapshot Semantics and Atomic Cutover
Batch jobs should never write directly to production tables. Write to versioned snapshots: predictions_v125. Validate the snapshot: check coverage (all expected entities present), sanity checks (no negative scores), A/B test if possible. Only then atomically switch the pointer: predictions_current now points to v125.

This enables instant rollback. Production metrics tank? Flip the pointer back to v124 in seconds. No data corruption, no partial states, no multi hour recovery.
Capacity Planning for Real-time
Concurrency approximately equals queries per second (QPS) multiplied by p99 latency. At 15,000 QPS with 60ms p99, you need 15,000 multiplied by 0.06 equals 900 concurrent request slots. Each instance handles perhaps 50 concurrent requests, so you need 18 instances minimum. Add 30% headroom for bursts and failures: 24 instances.

But this is steady state. Autoscaling lags 2 to 5 minutes. If traffic doubles in 30 seconds (common during events), you will violate SLOs until scale up completes. Solution: keep warm pools at 150% of predicted peak, or use predictive scaling that anticipates traffic patterns.
✓ In Practice: Uber style systems blend batch long term features (driver reliability, historical demand), nearline counters (recent cancellations updated every minute), and online context (current location, weather) to make dispatch decisions in under 100ms p95.

💡 Key Takeaways

✓Two stage retrieval and ranking splits work: batch generates candidate sets from billions of items, online ranks hundreds of candidates with fresh context in under 50ms

✓Nearline features provide minute level freshness via streaming aggregations, bridging batch (hours) and real-time (milliseconds) without full online training

✓Cascaded models with timeouts protect tail latency: sequence of increasingly expensive models with early exits, strict per stage budgets, fallback on timeout

✓Admission control and load shedding protect Service Level Objectives (SLOs) during overload by prioritizing high value requests and rejecting low priority traffic

✓Snapshot semantics for batch jobs enable instant rollback: write to versioned tables, validate, atomically cutover, flip back on errors

📌 Interview Tips

1YouTube two stage: batch computes 1000 candidates per user daily, online fetches in 5ms and ranks in 30ms, total 35ms fits in 150ms page budget

2Nearline streaming aggregations update <code>clicks_last_hour</code> every 1 to 5 minutes, providing middle ground between daily batch and full real-time inference

3Fraud cascaded models: rule check (2ms) catches 80%, tree ensemble (10ms) handles edge cases, neural network (40ms) for complex patterns, fallback on timeout

← Back to Batch vs Real-time Inference Overview