Batch Inference: Throughput Over Latency
The Batch Philosophy
Batch inference says "I will wait to accumulate work, then blast through it all at once using massive parallelism." You are not optimizing for how fast one prediction completes. You are optimizing for how many predictions per dollar and how efficiently you use compute.
The Execution Model
Batch jobs are embarrassingly parallel and bursty. You spin up thousands of Central Processing Unit (CPU) cores or Graphics Processing Unit (GPU) instances, partition your dataset across them, process everything in a coordinated window, write results, then shut down to zero. Consider a recommendation system generating candidates for 500 million users. You partition users into 10,000 chunks of 50,000 each. Each worker loads the model once, streams through its chunk, and writes predictions to storage. The entire job runs for 90 minutes using 10,000 cores, then terminates. Total cost: compute time only, no idle capacity.
When Batch Wins
Batch is ideal when utility decays slowly. Churn prediction for next month does not need to update every second. Weekly email campaign targeting can use predictions computed overnight. Content moderation backfills can run on 24 hour windows. The key question: does freshness matter enough to justify 5x to 20x higher cost?
Production Pattern: Prediction Store
The standard architecture writes predictions to a key value store indexed by entity. Schema: user_id → {prediction_scores, model_version, timestamp, ttl}. Applications read predictions by key lookup, never recomputing. This decouples inference cost from serving queries per second (QPS).
For example, YouTube might materialize top 1000 candidate video IDs per user daily. When you open the app, the service reads your precomputed list (1 Redis lookup, under 5ms), applies online filters, and returns results. Zero inference compute on the hot path.
The Straggler Problem
A few partitions often dominate compute time due to data skew. Maybe 95% of users finish in 60 minutes, but 5% with massive histories take 3 hours. Your job completion time is the slowest partition. Mitigation: dynamic repartitioning, speculative execution for slow tasks, or capping per entity work.
predictions_v123, validate, then atomically flip consumers to the new version.