Semantic Search for AI Agents:
How pgvector powers intelligent recall
Keyword search is broken for agent memory. Searching "car" shouldn't miss "vehicle", and "user prefers brevity" shouldn't fail to match "keep responses short". Semantic search fixes this by comparing meaning rather than text. Here is how pgvector cosine similarity, HNSW indexing, and multi-factor confidence scoring work together to make Kronvex's recall genuinely intelligent.
Keyword vs semantic search
Traditional keyword search (BM25, LIKE queries, full-text search) is exact: it looks for the words you typed. This works fine for document retrieval where you know the terminology. It breaks badly for agent memory, where:
- A memory stored as "user drives a Honda Civic" won't match a query for "what car does this person have"
- "Client is price-sensitive" won't match "what are this customer's constraints"
- "Meeting scheduled for Tuesday" won't match "what's coming up this week"
Semantic search operates differently: it converts both the stored memory and the query into high-dimensional vectors (embeddings) that encode meaning. Semantically similar texts end up close together in vector space, regardless of exact word choice.
How embeddings work
An embedding model (Kronvex uses OpenAI's text-embedding-3-small) converts a piece of text into a dense vector of floating-point numbers. For text-embedding-3-small, each text becomes a vector of 1536 dimensions.
Each dimension captures some latent semantic feature of the text — though the features are not human-interpretable. What matters is that the model was trained on vast amounts of text such that semantically related passages produce vectors that are geometrically close.
from openai import OpenAI client = OpenAI() def embed(text: str) -> list[float]: response = client.embeddings.create( model="text-embedding-3-small", input=text, ) return response.data[0].embedding # list of 1536 floats vec_memory = embed("User drives a Honda Civic") vec_query = embed("what car does this person drive") # vec_memory and vec_query will be close in cosine distance # despite sharing zero words
Kronvex calls this embedding step automatically when you call remember — you pass raw text, we store the vector. When you call recall, your query text is embedded on-the-fly and compared against stored vectors.
Cosine similarity explained
Two vectors can be compared using different distance metrics. Kronvex (and most embedding systems) use cosine similarity because it is invariant to vector magnitude — only direction matters, not length. This means a short memory and a long one with similar meaning will score comparably.
||A||, ||B|| = Euclidean norms (magnitudes)
Result range: −1 (opposite) to +1 (identical)
Typical similarity threshold for useful memories: > 0.75
In practice, cosine similarity scores for agent memory retrieval tend to cluster: identical or near-paraphrased content scores 0.92–0.99, semantically related but differently worded content scores 0.75–0.92, loosely related content scores 0.60–0.75, and unrelated content scores below 0.60.
import math def cosine_similarity(a: list[float], b: list[float]) -> float: dot = sum(x * y for x, y in zip(a, b)) norm_a = math.sqrt(sum(x**2 for x in a)) norm_b = math.sqrt(sum(x**2 for x in b)) return dot / (norm_a * norm_b) # pgvector operator equivalents: # <=> cosine distance (1 - cosine_similarity) # <-> Euclidean (L2) distance # <#> negative inner product
Note: pgvector uses the <=> operator for cosine distance (1 minus cosine similarity). So lower <=> values mean higher similarity. This is why Kronvex's SQL sorts ORDER BY embedding <=> query_embedding ASC.
HNSW: approximate nearest neighbour at scale
Exact nearest-neighbour search over 1536-dimensional vectors requires comparing the query against every stored vector — O(n) per query. For an agent with 100,000 memories, that's 100k dot products. At 1536 floats each, this is computationally expensive and grows linearly with memory count.
HNSW (Hierarchical Navigable Small World) is an approximate nearest-neighbour algorithm that trades a small amount of accuracy for massive speed gains. It builds a layered graph structure where each node is connected to its nearest neighbours at multiple granularity levels. Search traverses this graph instead of scanning the full table.
-- Create HNSW index on the embeddings column CREATE INDEX memories_embedding_hnsw ON memories USING hnsw (embedding vector_cosine_ops) WITH ( m = 16, -- connections per node (higher = better recall, more memory) ef_construction = 64 -- build-time search depth (higher = better index quality) ); -- Set search-time accuracy at query time SET hnsw.ef_search = 40; -- higher = more accurate but slower
Practical speed comparison on 100k memories with 1536-dim vectors: exact brute-force scan ~180ms per query; HNSW with ef_search=40 ~3ms per query. The approximate results miss 1–5% of the true nearest neighbours — an acceptable trade-off when the top-5 results are all you need.
Kronvex confidence scoring formula
Raw cosine similarity is not enough for memory retrieval. A memory stored yesterday with 0.82 similarity should rank above one stored two years ago with 0.88 similarity — especially for conversational context. Kronvex applies a multi-factor confidence score:
recency = sigmoid decay, inflection at 30 days: 1 / (1 + e^((age_days − 30) / 10))
frequency = log-scaled access count: log(1 + access_count) / log(1 + max_count)
confidence range: 0–1 (higher = return first)
Each factor plays a distinct role:
- Similarity (60%): The dominant signal. Ensures semantically irrelevant memories never surface regardless of age or frequency.
- Recency (20%): Uses a sigmoid function so memories from the last 30 days receive a meaningful boost, while very old memories decay gradually rather than abruptly. A 1-day-old memory scores ~0.95; a 60-day-old memory scores ~0.27.
- Frequency (20%): Memories that have been recalled many times are more likely to be important. Log-scaling prevents a memory recalled 1000 times from completely dominating. A memory accessed 10 times scores ~0.76 of maximum; one accessed once scores 0.
import math from datetime import datetime, timezone def confidence_score( similarity: float, created_at: datetime, access_count: int, max_access_count: int = 100, ) -> float: # Recency: sigmoid decay with 30-day inflection age_days = (datetime.now(timezone.utc) - created_at).days recency = 1.0 / (1.0 + math.exp((age_days - 30) / 10)) # Frequency: log-scaled access count frequency = ( math.log(1 + access_count) / math.log(1 + max_access_count) if max_access_count > 0 else 0.0 ) return similarity * 0.6 + recency * 0.2 + frequency * 0.2
Code: raw pgvector query vs Kronvex API
You can implement this yourself in raw SQL, or use the Kronvex API which handles embedding, indexing, and confidence scoring for you. Here is the comparison:
import asyncpg from openai import OpenAI async def recall_raw(query: str, agent_id: str, top_k: int = 5): # 1. Embed the query (API call + latency) oai = OpenAI() q_vec = oai.embeddings.create( model="text-embedding-3-small", input=query ).data[0].embedding # 2. Query pgvector (you manage connection pool, schema, index) conn = await asyncpg.connect(DATABASE_URL) rows = await conn.fetch(""" SELECT content, 1 - (embedding <=> $1::vector) AS similarity, created_at, access_count FROM memories WHERE agent_id = $2 ORDER BY embedding <=> $1::vector LIMIT $3 """, q_vec, agent_id, top_k) # 3. Apply confidence scoring manually results = [] max_access = max((r['access_count'] for r in rows), default=1) for row in rows: score = confidence_score( row['similarity'], row['created_at'], row['access_count'], max_access ) results.append({'content': row['content'], 'score': score}) return sorted(results, key=lambda x: -x['score']) # Total: ~150–200ms, ~60 lines of boilerplate, you manage DB schema
from kronvex import Kronvex kv = Kronvex("kv-your-api-key") agent = kv.agent("your-agent-id") result = agent.recall(query="what car does this user drive", top_k=5) for memory in result.memories: print(f"{memory.confidence:.3f} — {memory.content}") # 0.847 — User drives a Honda Civic # 0.761 — User mentioned needing a reliable commuter car # 0.634 — User asked about car insurance options # Total: ~50ms end-to-end, 3 lines, HNSW + confidence scoring included
Tuning recall threshold for your use case
The confidence threshold determines which memories are included in the context injected into your LLM. Setting it too low adds noise; too high misses relevant memories.
Recommended starting thresholds
- Customer support bots: 0.72 — include anything reasonably related to the current issue. False positives are less harmful than false negatives (missing a past resolution).
- Sales agents: 0.78 — be selective. Injecting wrong context to a prospect is worse than injecting none.
- Personal assistants: 0.65 — cast a wider net. Users want their assistant to make connections they didn't explicitly ask for.
- Code assistants: 0.80 — precise recall only. Injecting a wrong architecture assumption into a code context causes bugs.
result = agent.recall( query=user_message, top_k=8, min_confidence=0.72, # filter out low-confidence matches session_id=user_id, )
result.memories with their confidence scores for the first few hundred queries. Look at the distribution: if you're getting many results between 0.60–0.70, raise the threshold. If you're frequently getting zero results on queries you expect to match, lower it or re-examine how you're storing memories (shorter, more focused memory strings work better than long paragraphs).