Vector search, embeddings & RAG, explained interactively

Why vector search?

Keyword search matches the words you typed. Ask for "how do I remove rows" and it will miss a document that says "deleting records" - same meaning, different words. Vector search matches meaning. A model turns each piece of text into an embedding: a list of numbers - a point in a high-dimensional space - positioned so that things with similar meaning sit close together.

Once meaning is geometry, "find the most relevant documents" becomes "find the nearest points to this query point". That single shift powers semantic search, recommendations, deduplication, classification, and the retrieval half of RAG.

Embeddings & similarity

An embedding model maps text to a fixed-length vector - often hundreds to a few thousand dimensions. The 2026 landscape is rich: proprietary leaders such as Voyage 4 (voyage-4-large), OpenAI text-embedding-3-large, Cohere Embed v4, and Google Gemini Embedding 2 (now natively multimodal), alongside strong open models like BGE-M3, Nomic Embed v2, and Qwen3 Embedding. "Closeness" needs a metric, and the choice matters:

Cosine similarity - the angle between vectors, ignoring length. The usual default for text.
Dot product - rewards both alignment and magnitude. On normalised vectors, cosine and dot product are identical.
Euclidean (L2) distance - straight-line distance.

Two ideas reshaped 2026 embeddings. Matryoshka representation learning packs the most important signal into the leading dimensions, so you can truncate a 2048-d vector down to 256-d and keep most of the quality - a cheap memory and latency lever now common across Voyage 4, OpenAI, Cohere, and the open models. And multi-vector / late interaction - ColBERT's per-token vectors scored with MaxSim, and ColPali/ColQwen over document page images - swaps a single vector per chunk for many, lifting recall on hard queries, usually as a sharpening step over cheaper single-vector recall.

Pick a word below to make it the query, and switch the metric to watch the ranking change - especially around the far-out point.

Embedding space

Each dot is a word placed by meaning. Real embeddings have hundreds to thousands of dimensions - this is a 2D stand-in so you can see it. Click any word to make it the query.

Finding neighbours fast: ANN

Exact nearest-neighbour search compares the query to every vector - fine for thousands, hopeless for hundreds of millions. Approximate nearest neighbour (ANN) gives up a little recall for an enormous speed-up by being clever about which vectors it even looks at.

The dominant approach, HNSW, builds a layered proximity graph and walks it greedily towards the query, hopping from neighbour to neighbour. A search budget (efSearch) controls how many candidates it explores. Below: switch between exact and approximate, drag the budget, and watch recall and the number of vectors scanned move in opposite directions.

Speed is one axis; memory is the other, and at a billion vectors the index itself becomes the cost. Three levers shrink it: quantization - store int8 instead of float32 for roughly 4x less memory, or pack each dimension into a single bit (binary) for about 32x, then re-rank the survivors at full precision; Matryoshka truncation to fewer dimensions; and IVF (cluster-and-probe), which scans only a few partitions. Each trades a little accuracy for a lot of headroom.

Approximate nearest neighbour

Search budget (efSearch) 8

Click the map to move the query.

Bigger budget = more vectors scanned and higher recall, but slower. The memory levers trade differently: int8 quantization is about 4x smaller, binary about 32x, and Matryoshka truncation drops dimensions - each recovering accuracy with a full-precision re-rank of the shortlist.

Retrieval-augmented generation

RAG grounds a language model in your data instead of trusting its memory. The classic single-pass pipeline is simple: chunk your documents, embed and index them, retrieve the top-k most similar chunks for a question, optionally rerank them, stuff them into the prompt, and let the model generate a grounded, citable answer.

The catch: the answer can only be as good as the retrieval. Try it - change top-k and toggle the reranker, and watch the answer flip between grounded and useless.

RAG pipeline

Query Top-k 3 Reranker

Each query has one passage that truly answers it (★ gold) and a distractor that merely shares keywords. Retrieval scores are illustrative stand-ins for embedding similarity.

One retrieval pass is fragile: if the first search misses, the answer is doomed. The 2026 shift is agentic RAG - an LLM agent treats retrieval as a loop. It decomposes the question, retrieves, grades whether the evidence is sufficient, and if not, rewrites or splits the query and retrieves again, stopping only when it is confident enough to answer. For multi-hop or compositional questions, GraphRAG (Microsoft) walks a knowledge graph built from the corpus instead of relying on flat top-k - reached selectively via complexity-aware routing, since the loop costs more tokens and latency.

Step through the loop below, then flip to one-shot to see where a single pass gives up.

Agentic RAG loop

Multi-hop question

For the similarity metric the docs recommend as the default, is the score affected by vector length?

Iterations 0 Retrieval calls 0

Enable JavaScript to step through the retrieve - grade - re-retrieve loop.

An illustrative agentic loop: the agent retrieves, grades whether the evidence is sufficient, and re-retrieves with a refined query until it can answer - versus a single-pass pipeline that answers from one shot. The corpus and the grader's verdicts are authored stand-ins for a real retriever and grader model.

Making retrieval actually good

Most "the LLM hallucinated" problems are really retrieval problems. The levers that matter:

Chunking - too large and the relevant sentence is diluted; too small and it loses context. Recursive or semantic chunking with a little overlap beats fixed sizes; contextual retrieval (Anthropic) prepends a short document-level blurb to each chunk before embedding (35-49% fewer retrieval failures), and late chunking (Jina) embeds the whole document first, then pools token embeddings into chunks so cross-chunk context survives.
Hybrid search - combine dense vectors with classic BM25 keyword scoring (fused with reciprocal rank fusion). Vectors catch meaning; keywords catch exact terms, names, and codes.
Reranking - a cross-encoder reads the query and each candidate together and re-scores them; ColBERT-style late interaction and hosted rerankers (Cohere, Voyage) do the same job. Slower per item, so you run it only on the shortlist - and it sharply improves precision.
GraphRAG - for multi-hop or compositional questions, retrieve over a knowledge graph built from the corpus rather than flat top-k, then reason across the linked entities.
Metadata filtering - constrain by tenant, date, or document type before or during the vector search.

And measure it. Retrieval quality: recall@k, MRR, nDCG. Answer quality: faithfulness (is it grounded in the context?) and answer relevance, with tools like RAGAS, Phoenix, and DeepEval. Watch for classic failure modes - lost in the middle (the model ignores chunks buried in a long context), context rot (accuracy drifts down as the context window fills, well before the token limit), chunk-boundary loss, a stale index, and query-document mismatch.

Chunking is the highest-leverage and most underrated knob. Pick a strategy and watch how the same document is split - and whether the chunk you retrieve still carries enough context to answer.

Chunking strategy explorer

Query Does Iceberg v3 support row-level deletes?

Enable JavaScript to see how the document is chunked and which chunk is retrieved.

One short document, one query, four strategies. The split points, the retrieved chunk, and the verdicts are authored stand-ins to show how context survives (or does not) - real chunkers use token counts, embeddings, and overlap.

Dense vectors and keyword search fail in opposite ways. Hybrid search runs both and fuses them with reciprocal rank fusion; a reranker then sharpens the top. See it on one query below.

Hybrid search + reranking

Query How do I fix error E-4012 when deleting rows?

Enable JavaScript to compare the dense, BM25, and fused rankings.

Two gold passages answer this: one names the exact code (E-4012), one explains the fix in plain words. Dense, BM25, and relevance scores are authored stand-ins - watch which ranking finds which gold passage, and how fusion and reranking combine them.

Tools & the 2026 picture

You do not always need a dedicated vector database. pgvector (with pgvectorscale for larger workloads) turns Postgres into a capable vector store and is often enough; Elasticsearch/OpenSearch, ClickHouse, and DuckDB's VSS extension add vector search to engines you may already run. Dedicated stores - Pinecone, Weaviate, Qdrant (with native multi-vector / late-interaction support), Milvus, Chroma, LanceDB, and object-storage newcomers like Turbopuffer - earn their place at large scale, with heavy filtering, or when you want built-in hybrid search and reranking. The honest rule: start with what is already in your stack; reach for a specialised store when scale or features force it.

The 2026 picture: retrieval is becoming a loop, not a step - agentic RAG decomposes and re-retrieves, and MCP (the Model Context Protocol) standardises how agents reach tools and data. For small, stable corpora, cache-augmented generation (CAG) preloads everything into the context and skips retrieval altogether. And the recurring question - do million-token context windows kill RAG? The consensus is still no: they are complementary, and context rot means stuffing the window is not free. Long context helps within a single document, but RAG still wins on scale, cost, freshness, access control, and citations. You retrieve then reason.

-- pgvector: vector search inside Postgres
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE docs (
    id        BIGSERIAL PRIMARY KEY,
    content   TEXT,
    embedding VECTOR(1536)            -- e.g. text-embedding-3-small
);

-- HNSW index for fast approximate search (cosine distance)
CREATE INDEX ON docs USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

-- Trade recall vs speed at query time
SET hnsw.ef_search = 40;

-- Top-5 nearest by cosine distance  (<=> is the cosine-distance operator)
SELECT id, content
FROM docs
ORDER BY embedding <=> :query_embedding
LIMIT 5;

from sentence_transformers import SentenceTransformer
import numpy as np

# 2026 open model with Matryoshka dimensions - truncate to save memory
model = SentenceTransformer("nomic-ai/nomic-embed-text-v2-moe",
                            truncate_dim=256)             # 768 -> 256 dims
docs = ["deleting rows in Iceberg ...", "HNSW graph search ...", "..."]
emb = model.encode(docs, normalize_embeddings=True)       # unit vectors

q = model.encode(["how do deletion vectors work?"],
                 normalize_embeddings=True)[0]

# on normalised vectors, cosine == dot product
scores = emb @ q
for i in np.argsort(-scores)[:5]:
    print(round(float(scores[i]), 3), docs[i])

# Minimal RAG: retrieve, (rerank), then ground the answer
def answer(question, k=5, rerank=True):
    hits = vector_store.search(embed(question), k=k)        # bi-encoder recall
    if rerank:
        hits = cross_encoder.rerank(question, hits)[:k]    # precision

    context = "\n\n".join(h.text for h in hits)
    prompt = (
        "Answer using only the context, and cite sources.\n\n"
        f"Context:\n{context}\n\nQuestion: {question}\nAnswer:"
    )
    return llm.complete(prompt)

# Agentic RAG: retrieve -> grade -> re-retrieve until sufficient
def agentic_answer(question, max_steps=3):
    query, calls, context = question, 0, []
    for _ in range(max_steps):
        hits = vector_store.search(embed(query), k=5)   # bi-encoder recall
        calls += 1
        context += hits
        verdict = grade(question, context)              # an LLM "grader"
        if verdict.sufficient:
            break
        query = verdict.refined_query                   # decompose / rewrite

    prompt = build_prompt(question, context)            # cite sources
    return llm.complete(prompt), calls

Key takeaways

Embeddings turn meaning into geometry: similar text becomes nearby vectors, so search becomes "find the nearest points".
The similarity metric matters - cosine for direction, dot product when magnitude counts; they coincide on normalised vectors.
Exact search does not scale; ANN (HNSW, IVF, quantization) trades a little recall for a large speed-up, tuned by a search budget.
RAG is only as good as its retrieval - top-k, hybrid search, and reranking decide whether the answer is grounded or guessed.
Start with pgvector or what you already run; reach for a dedicated vector store at scale. Long context complements RAG, it does not replace it.

Check your understanding

Five quick questions.

Vector search, embeddings & RAG