Embeddings & Similarity SearchEmbedding Generation (BERT, Sentence-BERT, Graph Embeddings)Easy⏱️ ~2 min

What is Embedding Generation and Why It Matters

Definition
Embedding generation converts raw inputs (text, images, users, items) into fixed-length numerical vectors where similar things are close together. A sentence becomes 768 numbers; similar sentences have similar numbers.

WHY EMBEDDINGS EXIST

Raw inputs are hard to compare. How similar are "cheap flights to Paris" and "affordable plane tickets to France"? String matching fails. Embeddings solve this: both sentences map to nearby vectors, and vector distance measures semantic similarity.

The key property: similar inputs produce similar vectors. If you train embeddings on search clicks, queries that lead to the same results will cluster together even with different words.

TYPES OF EMBEDDINGS

Text embeddings: Neural networks (BERT, Sentence-BERT) encode sentences into 384-768 dim vectors. Inference: 10-50ms on GPU.

Image embeddings: CNNs or Vision Transformers encode images into 512-2048 dim vectors. Used for visual similarity search.

Graph embeddings: Encode user-item interactions into vectors. Capture collaborative signals (users who click similar items have similar embeddings).

THE EMBEDDING PIPELINE

Training: collect pairs of similar items (clicks, purchases), train model to make their embeddings close. Inference: encode new items, store vectors, use ANN search to find similar ones.

💡 Key Insight: Embeddings are only as good as your similarity definition. Click data produces click-similarity embeddings. Purchase data produces purchase-similarity embeddings. Choose training signal carefully.
💡 Key Takeaways
Embeddings map inputs to vectors where similar things are close
Text embeddings: 384-768 dims, 10-50ms inference on GPU
Training signal defines similarity—clicks vs purchases produce different embeddings
📌 Interview Tips
1Interview Tip: Explain why embeddings beat keyword matching—semantic similarity captures meaning, not just word overlap.
2Interview Tip: Describe how training signal affects embedding quality—click embeddings differ from purchase embeddings.
← Back to Embedding Generation (BERT, Sentence-BERT, Graph Embeddings) Overview