Deep Dive Python pgvector March 22, 2026 · 12 min read

Semantic Search for AI Agents:
How pgvector powers intelligent recall

Keyword search is broken for agent memory. Searching "car" shouldn't miss "vehicle", and "user prefers brevity" shouldn't fail to match "keep responses short". Semantic search fixes this by comparing meaning rather than text. Here is how pgvector cosine similarity, HNSW indexing, and multi-factor confidence scoring work together to make Kronvex's recall genuinely intelligent.

In this article

Keyword vs semantic search
How embeddings work
Cosine similarity explained
HNSW: approximate nearest neighbour at scale
Kronvex confidence scoring formula
Code: raw pgvector vs Kronvex API
Tuning recall threshold for your use case

Keyword vs semantic search

Traditional keyword search (BM25, LIKE queries, full-text search) is exact: it looks for the words you typed. This works fine for document retrieval where you know the terminology. It breaks badly for agent memory, where:

A memory stored as "user drives a Honda Civic" won't match a query for "what car does this person have"
"Client is price-sensitive" won't match "what are this customer's constraints"
"Meeting scheduled for Tuesday" won't match "what's coming up this week"

Semantic search operates differently: it converts both the stored memory and the query into high-dimensional vectors (embeddings) that encode meaning. Semantically similar texts end up close together in vector space, regardless of exact word choice.

The intuition: In a well-trained embedding space, "car", "vehicle", "automobile", and "Honda Civic" all point in roughly the same direction. A query for "what car does the user drive" will be close to all of them — and the nearest neighbours are your recalled memories.

How embeddings work

An embedding model (Kronvex uses OpenAI's text-embedding-3-small) converts a piece of text into a dense vector of floating-point numbers. For text-embedding-3-small, each text becomes a vector of 1536 dimensions.

Each dimension captures some latent semantic feature of the text — though the features are not human-interpretable. What matters is that the model was trained on vast amounts of text such that semantically related passages produce vectors that are geometrically close.

Generating an embedding — Python

from openai import OpenAI

client = OpenAI()

def embed(text: str) -> list[float]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text,
    )
    return response.data[0].embedding  # list of 1536 floats

vec_memory = embed("User drives a Honda Civic")
vec_query  = embed("what car does this person drive")

# vec_memory and vec_query will be close in cosine distance
# despite sharing zero words

Kronvex calls this embedding step automatically when you call remember — you pass raw text, we store the vector. When you call recall, your query text is embedded on-the-fly and compared against stored vectors.

Cosine similarity explained

Two vectors can be compared using different distance metrics. Kronvex (and most embedding systems) use cosine similarity because it is invariant to vector magnitude — only direction matters, not length. This means a short memory and a long one with similar meaning will score comparably.

cos(A, B) = (A · B) / (||A|| × ||B||)

A · B = dot product of vectors A and B
||A||, ||B|| = Euclidean norms (magnitudes)
Result range: −1 (opposite) to +1 (identical)
Typical similarity threshold for useful memories: > 0.75

In practice, cosine similarity scores for agent memory retrieval tend to cluster: identical or near-paraphrased content scores 0.92–0.99, semantically related but differently worded content scores 0.75–0.92, loosely related content scores 0.60–0.75, and unrelated content scores below 0.60.

Cosine similarity in pure Python

import math

def cosine_similarity(a: list[float], b: list[float]) -> float:
    dot = sum(x * y for x, y in zip(a, b))
    norm_a = math.sqrt(sum(x**2 for x in a))
    norm_b = math.sqrt(sum(x**2 for x in b))
    return dot / (norm_a * norm_b)

# pgvector operator equivalents:
# <=>  cosine distance (1 - cosine_similarity)
# <->  Euclidean (L2) distance
# <#>  negative inner product

Note: pgvector uses the <=> operator for cosine distance (1 minus cosine similarity). So lower <=> values mean higher similarity. This is why Kronvex's SQL sorts ORDER BY embedding <=> query_embedding ASC.

HNSW: approximate nearest neighbour at scale

Exact nearest-neighbour search over 1536-dimensional vectors requires comparing the query against every stored vector — O(n) per query. For an agent with 100,000 memories, that's 100k dot products. At 1536 floats each, this is computationally expensive and grows linearly with memory count.

HNSW (Hierarchical Navigable Small World) is an approximate nearest-neighbour algorithm that trades a small amount of accuracy for massive speed gains. It builds a layered graph structure where each node is connected to its nearest neighbours at multiple granularity levels. Search traverses this graph instead of scanning the full table.

Creating an HNSW index in pgvector

-- Create HNSW index on the embeddings column
CREATE INDEX memories_embedding_hnsw
ON memories
USING hnsw (embedding vector_cosine_ops)
WITH (
  m = 16,            -- connections per node (higher = better recall, more memory)
  ef_construction = 64  -- build-time search depth (higher = better index quality)
);

-- Set search-time accuracy at query time
SET hnsw.ef_search = 40;  -- higher = more accurate but slower

HNSW vs IVFFlat: pgvector supports two ANN index types. IVFFlat partitions the space into clusters (Voronoi cells) — fast to build but requires knowing the data distribution. HNSW builds dynamically and handles streaming inserts better. For agent memory where new memories are constantly being added, HNSW is the right choice.

Practical speed comparison on 100k memories with 1536-dim vectors: exact brute-force scan ~180ms per query; HNSW with ef_search=40 ~3ms per query. The approximate results miss 1–5% of the true nearest neighbours — an acceptable trade-off when the top-5 results are all you need.

Kronvex confidence scoring formula

Raw cosine similarity is not enough for memory retrieval. A memory stored yesterday with 0.82 similarity should rank above one stored two years ago with 0.88 similarity — especially for conversational context. Kronvex applies a multi-factor confidence score:

confidence = similarity × 0.6 + recency × 0.2 + frequency × 0.2

similarity = cosine similarity (0–1)
recency = sigmoid decay, inflection at 30 days: 1 / (1 + e^((age_days − 30) / 10))
frequency = log-scaled access count: log(1 + access_count) / log(1 + max_count)
confidence range: 0–1 (higher = return first)

Each factor plays a distinct role:

Similarity (60%): The dominant signal. Ensures semantically irrelevant memories never surface regardless of age or frequency.
Recency (20%): Uses a sigmoid function so memories from the last 30 days receive a meaningful boost, while very old memories decay gradually rather than abruptly. A 1-day-old memory scores ~0.95; a 60-day-old memory scores ~0.27.
Frequency (20%): Memories that have been recalled many times are more likely to be important. Log-scaling prevents a memory recalled 1000 times from completely dominating. A memory accessed 10 times scores ~0.76 of maximum; one accessed once scores 0.

Confidence score — Python implementation

import math
from datetime import datetime, timezone

def confidence_score(
    similarity: float,
    created_at: datetime,
    access_count: int,
    max_access_count: int = 100,
) -> float:
    # Recency: sigmoid decay with 30-day inflection
    age_days = (datetime.now(timezone.utc) - created_at).days
    recency = 1.0 / (1.0 + math.exp((age_days - 30) / 10))

    # Frequency: log-scaled access count
    frequency = (
        math.log(1 + access_count) / math.log(1 + max_access_count)
        if max_access_count > 0 else 0.0
    )

    return similarity * 0.6 + recency * 0.2 + frequency * 0.2

Code: raw pgvector query vs Kronvex API

You can implement this yourself in raw SQL, or use the Kronvex API which handles embedding, indexing, and confidence scoring for you. Here is the comparison:

Raw pgvector approach (you manage everything)

import asyncpg
from openai import OpenAI

async def recall_raw(query: str, agent_id: str, top_k: int = 5):
    # 1. Embed the query (API call + latency)
    oai = OpenAI()
    q_vec = oai.embeddings.create(
        model="text-embedding-3-small", input=query
    ).data[0].embedding

    # 2. Query pgvector (you manage connection pool, schema, index)
    conn = await asyncpg.connect(DATABASE_URL)
    rows = await conn.fetch("""
        SELECT content, 1 - (embedding <=> $1::vector) AS similarity,
               created_at, access_count
        FROM memories
        WHERE agent_id = $2
        ORDER BY embedding <=> $1::vector
        LIMIT $3
    """, q_vec, agent_id, top_k)

    # 3. Apply confidence scoring manually
    results = []
    max_access = max((r['access_count'] for r in rows), default=1)
    for row in rows:
        score = confidence_score(
            row['similarity'], row['created_at'],
            row['access_count'], max_access
        )
        results.append({'content': row['content'], 'score': score})

    return sorted(results, key=lambda x: -x['score'])

# Total: ~150–200ms, ~60 lines of boilerplate, you manage DB schema

Kronvex API approach (managed)

from kronvex import Kronvex

kv = Kronvex("kv-your-api-key")
agent = kv.agent("your-agent-id")

result = agent.recall(query="what car does this user drive", top_k=5)

for memory in result.memories:
    print(f"{memory.confidence:.3f} — {memory.content}")
# 0.847 — User drives a Honda Civic
# 0.761 — User mentioned needing a reliable commuter car
# 0.634 — User asked about car insurance options

# Total: ~50ms end-to-end, 3 lines, HNSW + confidence scoring included

Tuning recall threshold for your use case

The confidence threshold determines which memories are included in the context injected into your LLM. Setting it too low adds noise; too high misses relevant memories.

Recommended starting thresholds

Customer support bots: 0.72 — include anything reasonably related to the current issue. False positives are less harmful than false negatives (missing a past resolution).
Sales agents: 0.78 — be selective. Injecting wrong context to a prospect is worse than injecting none.
Personal assistants: 0.65 — cast a wider net. Users want their assistant to make connections they didn't explicitly ask for.
Code assistants: 0.80 — precise recall only. Injecting a wrong architecture assumption into a code context causes bugs.

Setting threshold in Kronvex recall

result = agent.recall(
    query=user_message,
    top_k=8,
    min_confidence=0.72,  # filter out low-confidence matches
    session_id=user_id,
)

Monitoring recall quality: Log result.memories with their confidence scores for the first few hundred queries. Look at the distribution: if you're getting many results between 0.60–0.70, raise the threshold. If you're frequently getting zero results on queries you expect to match, lower it or re-examine how you're storing memories (shorter, more focused memory strings work better than long paragraphs).

Memory granularity matters: Cosine similarity works best with focused, single-fact memories. "User prefers concise answers, works in fintech, uses Python, has a team of 5, based in Berlin, wants weekly summaries" stored as a single string will match poorly against any single query. Store each fact as a separate memory — the retrieval system will surface the relevant ones.

Semantic Search for AI Agents:How pgvector powers intelligent recall

Keyword vs semantic search

How embeddings work

Cosine similarity explained

HNSW: approximate nearest neighbour at scale

Kronvex confidence scoring formula

Code: raw pgvector query vs Kronvex API

Tuning recall threshold for your use case

Recommended starting thresholds

Semantic Search for AI Agents:
How pgvector powers intelligent recall