ENGINEERING March 20, 2026 · 8 min read

HNSW Vector Index Explained: How pgvector Achieves Sub-40ms Recall at Scale

Brute-force cosine similarity works fine for 10,000 vectors. At 1 million, it's unusable. HNSW — the algorithm behind pgvector's fast index — gives you approximate nearest neighbors in logarithmic time. Here's how it works and how to tune it.

The problem with exact vector search

A naive vector similarity search scans every stored vector, computes the cosine distance to your query vector, and returns the top-k results. For a 1536-dimensional embedding (OpenAI's text-embedding-3-small), this means multiplying 1536 floats per stored memory, then sorting the entire result set.

At 10k memories, this is ~15ms. At 100k memories, it's ~150ms. At 1M memories, you're looking at 1.5 seconds — far too slow for a real-time agent recall. The math is linear: O(n × d) where n is the number of vectors and d is the dimension.

Approximate Nearest Neighbor (ANN) algorithms trade a small amount of recall accuracy for dramatic speed improvements. HNSW (Hierarchical Navigable Small World) is the current state of the art.

What HNSW actually does

HNSW builds a multi-layer graph where each node is a vector. The key insight comes from "small world" network theory: in a well-connected graph, any node can reach any other node in a logarithmic number of hops. HNSW constructs this property deliberately.

The layered graph structure

Imagine a stack of graphs. The top layer is sparse — each node connects to a few long-range neighbors. The bottom layer is dense — each node connects to many close neighbors. Insertion assigns each node to a random maximum layer using an exponential distribution, so most nodes only live in the bottom layer while a few exist across many layers.

During search, the algorithm enters at the top layer and greedily navigates toward the query vector, descending to each lower layer and refining the search. This coarse-to-fine navigation is what gives HNSW its logarithmic complexity.

The two key parameters: m and ef_construction

m — the number of bidirectional connections each node creates when inserted. Higher m means denser graphs, better recall, but more memory and slower inserts. pgvector defaults to 16.
ef_construction — the size of the dynamic candidate list during index construction. Higher values improve index quality at the cost of build time. pgvector defaults to 64.

-- Create an HNSW index on a 1536-dim vector column
CREATE INDEX memories_hnsw_idx
ON memories
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- At query time, set ef (search beam width)
SET hnsw.ef_search = 100;

The ef_search parameter controls query-time quality. Higher values search more candidates before returning results — better recall, slightly higher latency. The sweet spot for most agent memory workloads is 64–100.

HNSW vs IVFFlat: which to use?

pgvector supports two index types. Here's when to use each:

HNSW — better recall, no training required, handles small and large datasets well. Preferred for most use cases. Higher memory usage.
IVFFlat — requires a training phase (needs ~10k vectors before building), lower memory, slightly worse recall. Use when memory is constrained or the dataset is stable.

For agent memory specifically: HNSW is almost always the right choice. Memory stores grow incrementally — you can't pre-train an IVFFlat index effectively on a cold dataset, and HNSW handles incremental inserts gracefully without requiring a rebuild.

Measuring actual recall quality

HNSW is approximate. How approximate? With default settings (m=16, ef_search=64), pgvector typically achieves 95–99% recall@10 — meaning 95–99% of the true top-10 nearest neighbors are returned. For agent memory retrieval, this is more than sufficient.

You can measure recall on your own dataset with a simple test: run the same query with and without the index and compare results.

# Measure HNSW recall vs exact search in Python
import psycopg2
import numpy as np

def measure_recall(conn, query_embedding, k=10):
    cur = conn.cursor()

    # Exact search (no index)
    cur.execute("SET enable_indexscan = off;")
    cur.execute(
        "SELECT id FROM memories ORDER BY embedding <=> %s LIMIT %s",
        (query_embedding, k)
    )
    exact = set(r[0] for r in cur.fetchall())

    # HNSW search (with index)
    cur.execute("SET enable_indexscan = on; SET hnsw.ef_search = 64;")
    cur.execute(
        "SELECT id FROM memories ORDER BY embedding <=> %s LIMIT %s",
        (query_embedding, k)
    )
    approx = set(r[0] for r in cur.fetchall())

    recall = len(exact & approx) / k
    print(f"Recall@{k}: {recall:.2%}")
    return recall

Performance at scale: real numbers

On a standard PostgreSQL instance (4 vCPUs, 8GB RAM) with Supabase EU (the stack Kronvex runs on), HNSW achieves:

100k vectors — median query latency: ~12ms, p99: ~28ms
1M vectors — median query latency: ~35ms, p99: ~65ms
10M vectors — median query latency: ~90ms, p99: ~180ms

For most B2B AI agent deployments, 1M memories covers years of operation across thousands of users. Sub-40ms median recall at that scale means memory injection adds negligible latency to your LLM calls.

Index maintenance and build time

Unlike IVFFlat, HNSW indexes are maintained incrementally — each insert updates the graph structure. This adds ~1–2ms per insert at scale, which is acceptable for most workloads.

If you're doing bulk imports (migrating from another system), it's faster to insert all vectors first and then build the index:

-- Bulk import without index pressure
ALTER TABLE memories DROP INDEX IF EXISTS memories_hnsw_idx;

-- ... bulk insert here ...

-- Rebuild after import
CREATE INDEX memories_hnsw_idx
ON memories USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

How Kronvex uses HNSW

Kronvex stores memories with 1536-dimensional embeddings from OpenAI's text-embedding-3-small. Every /recall and /inject-context call executes a single HNSW cosine search filtered by agent ID. The confidence score reranking (similarity × 0.6 + recency × 0.2 + frequency × 0.2) runs in application code on the top-k candidates returned by the index, keeping the database query simple and fast.

This architecture keeps median recall latency under 25ms even as memory stores grow into the hundreds of thousands — fast enough that your users will never notice the memory lookup in the overall LLM call time.