Embeddings & Similarity SearchApproximate Nearest Neighbors (FAISS, ScaNN, HNSW)Medium⏱️ ~2 min

Two Stage Retrieval: ANN Candidate Generation + Re-Ranking

Two stage retrieval decouples fast candidate generation from expensive scoring. The first stage uses an ANN index to retrieve k candidates, typically 200 to 1000, at high recall like 95 to 99 percent within a small latency budget of 5 to 20 milliseconds. The second stage applies a full ranking model with rich features including non vector signals like user history, item popularity, and contextual attributes. This re-ranker scores only the k candidates, so even a 40 millisecond model fits within a 100 millisecond end to end Service Level Agreement (SLA). The pattern allows model complexity to scale independently of corpus size. You can retrieve from 1 billion items in 10 milliseconds using FAISS, then score 500 candidates with a deep neural network that would take seconds if applied to all items. This architecture is ubiquitous in production. Google Search uses ANN for document retrieval followed by learning to rank models. LinkedIn feed ranking retrieves posts via embedding similarity, then applies a gradient boosted decision tree re-ranker. YouTube recommendations retrieve videos with ANN on user and video embeddings, then re-rank with watch time prediction models. The key metric is stage one recall. If ANN misses relevant items, the re-ranker cannot recover them. Operators measure recall by running exact search on a small canary subset, for example 1 million items on a single GPU, and comparing ANN results. A 2 percent recall drop can degrade final relevance metrics like Click Through Rate (CTR) or conversion by 5 to 10 percent. Tuning involves balancing k and ANN parameters. Increasing k from 200 to 1000 improves re-ranker coverage but adds latency. Increasing efSearch or nprobe improves ANN recall but slows retrieval. The system must meet both recall targets and latency budgets at p99.
💡 Key Takeaways
Stage one ANN retrieves 200 to 1000 candidates at 95 to 99 percent recall in 5 to 20 milliseconds, stage two re-ranks with full model in 20 to 50 milliseconds
Recall loss in stage one cannot be recovered: 2 percent ANN recall drop can reduce final CTR or conversion by 5 to 10 percent in production metrics
Candidate count k trades coverage for latency: k=200 is fast but may miss relevant items, k=1000 gives re-ranker more options but adds 10 to 20 milliseconds
Re-ranker can use non vector features like user history, item popularity, recency, and contextual signals that would be too expensive for billion item retrieval
Monitor stage one recall continuously by comparing ANN results to exact search on a 0.1 percent canary subset, set alerts when recall drops below threshold
Used universally in production: Google Search document retrieval, LinkedIn feed ranking, YouTube video recommendations, Spotify playlist generation
📌 Examples
YouTube recommendations: ANN retrieves 800 video candidates from 100 million corpus in 12 milliseconds at 98 percent recall, neural network re-ranker scores with watch time prediction in 35 milliseconds
LinkedIn feed: embedding similarity retrieves 500 post candidates in 8 milliseconds, gradient boosted tree re-ranker uses 50 features including connection strength and engagement history, scores in 25 milliseconds
Monitoring setup: sample 0.1 percent of queries, run exact search on 1 million item subset on single A100 GPU, compare to ANN results, track recall@500 and recall@1000 metrics, alert if recall drops below 96 percent
← Back to Approximate Nearest Neighbors (FAISS, ScaNN, HNSW) Overview
Two Stage Retrieval: ANN Candidate Generation + Re-Ranking | Approximate Nearest Neighbors (FAISS, ScaNN, HNSW) - System Overflow