Production Implementation: Architecture and Monitoring

Production two tower systems require careful orchestration of offline training pipelines and online serving infrastructure. The typical flow starts with a batch job that generates training examples from user interaction logs with strict temporal cutoffs to prevent leakage: use interactions before time T to predict interactions after T, ensuring at least a 24 hour gap. Training runs on GPU clusters for 12 to 48 hours depending on scale; Google trains YouTube retrieval models on billions of examples. After training, a separate batch job computes embeddings for all items in the catalog by running the item tower, outputs vectors to a distributed file system, then builds ANN indexes sharded by eligibility criteria like locale or category.

Serving architecture separates user tower inference from vector search. User tower runs on CPU or GPU depending on complexity, typically taking 1 to 10ms for a forward pass. The embedding is then sent to a vector search service hosting prebuilt ANN indexes in memory. FAISS and ScaNN indexes are loaded from versioned snapshots, and hot swapping new indexes requires cache warming with synthetic queries to avoid p99 latency spikes during the first few minutes. Google preloads indexes and runs shadow traffic for 10 to 30 minutes before routing production queries. Sharding is essential: a single host might serve 10 to 50 million items; larger catalogs are distributed across dozens of shards with client side fanout and merging of top K results.

Monitoring must cover both model quality and serving health. Model quality metrics include offline recall@K and hit rate on temporal holdout sets, embedding drift measured as cosine distance between consecutive model versions, and training serving skew detected by logging feature distributions at serving and comparing to training. Serving health tracks p50, p95, p99 latency for user tower inference and ANN search separately, ANN recall via periodic brute force comparison on sampled queries, candidate coverage showing what fraction of catalog appears in top K for any query, and cache hit rates for precomputed embeddings.

Canary deployments are critical for safe rollouts. New models or indexes are deployed to 1 to 5% of traffic, and automated dashboards compare online metrics like CTR, dwell time, and diversity against the control group. If recall@100 drops by more than 2% or p99 latency increases by more than 10ms, rollback is automatic. Meta and Google run multi day canaries with gradual ramp from 1% to 10% to 50% before full rollout, with per region controls to isolate geographic variance.

💡 Key Takeaways

•Offline pipeline runs daily or hourly: generate training data with strict temporal cutoffs, train on GPU for 12 to 48 hours, batch compute all item embeddings, build sharded ANN indexes, and deploy versioned snapshots

•Online serving separates user tower inference taking 1 to 10ms from ANN search taking 5 to 15ms; prefiltering by eligibility before ANN maintains high recall and reduces wasted compute on ineligible items

•Hot swapping new indexes requires cache warming with synthetic queries for 10 to 30 minutes to avoid p99 latency spikes; Google runs shadow traffic before routing production queries to new versions

•Monitor offline recall@K on temporal holdout, embedding drift as cosine distance between model versions, training serving skew via feature distribution comparisons, and online metrics like CTR and dwell time

•Track serving health with separate p50/p95/p99 latency for user tower and ANN search, ANN recall via periodic brute force on sampled queries, and candidate coverage showing catalog diversity in top K results

•Canary new models to 1 to 5% of traffic with automated rollback if recall@100 drops more than 2% or p99 latency increases more than 10ms; ramp gradually from 1% to 10% to 50% over multiple days

📌 Examples

YouTube trains retrieval models daily on billions of interaction examples; batch job computes embeddings for 100+ million videos, builds ScaNN indexes sharded by language and region, deploys with 30 minute shadow traffic warmup before production cutover

Spotify runs nightly training and index builds; serves from 20+ shards each holding 10 million tracks with 3 to 5ms ANN latency; monitors ANN recall by running exact search on 0.1% of queries and alerts if recall drops below 90%

← Back to Two-Tower Models (User/Item Embeddings) Overview