Building a Production Support Agent with Memory
By the end of this walkthrough you'll have a working customer support agent that knows your product docs, remembers every user, and gets better with every conversation. FastAPI backend, Kronvex memory, LLM of your choice.
Most AI support agents are stateless. They're fast, they know your docs, but every conversation starts from scratch. The user has to re-explain who they are, what plan they're on, what they tried last week. It feels like talking to a goldfish with a great knowledge base.
This walkthrough adds memory. By the end you'll have an agent that remembers every user across sessions, personalizes answers based on their history, and stores new context automatically after every conversation.
Stack: FastAPI · Kronvex · OpenAI GPT-4o · your existing vector DB for RAG. The memory layer is swappable — the patterns work with any LLM.
What we're building
Prerequisites
- Kronvex API key — free tier at kronvex.io (100 memories, 1 agent)
- OpenAI API key — or any OpenAI-compatible LLM endpoint
- A vector DB with your docs — Supabase pgvector, Pinecone, Weaviate, etc.
- Python 3.10+ and pip
Step-by-step
pip install fastapi uvicorn openai kronvex httpx python-dotenv
.env file — never commit this.KRONVEX_API_KEY=kx_live_your_key_here
KRONVEX_AGENT_ID=agent_support_001
OPENAI_API_KEY=sk-your_openai_key
# Optional — your RAG endpoint
RAG_ENDPOINT=https://your-rag-api.example.com/search
import os import httpx from kronvex import KronvexClient kx = KronvexClient(api_key=os.getenv("KRONVEX_API_KEY")) AGENT_ID = os.getenv("KRONVEX_AGENT_ID") RAG_ENDPOINT = os.getenv("RAG_ENDPOINT") async def get_rag_chunks(query: str, top_k: int = 3) -> str: """Retrieve relevant doc chunks from your vector DB.""" if not RAG_ENDPOINT: return "" async with httpx.AsyncClient() as client: r = await client.post(RAG_ENDPOINT, json={ "query": query, "top_k": top_k }, timeout=4.0) chunks = r.json().get("results", []) return "\n\n".join(c.get("text", "") for c in chunks) async def build_context(user_id: str, message: str) -> str: """Build the full context block: RAG docs + user memories.""" sections = [] # 1. RAG — shared product knowledge try: docs = await get_rag_chunks(message) if docs: sections.append(f"[PRODUCT DOCUMENTATION]\n{docs}") except Exception: pass # RAG failure is non-fatal # 2. Kronvex — user-specific memory try: ctx = kx.inject_context( message=message, agent_id=user_id, threshold=0.65, top_k=5 ) if ctx.memories_used > 0: sections.append(ctx.context_block) except Exception: pass # Memory failure is non-fatal return "\n\n".join(sections) async def store_interaction( user_id: str, user_message: str, agent_response: str, session_id: str ): """Store the interaction as episodic memory after responding.""" # Only store substantive exchanges (skip "thanks", "ok", etc.) if len(user_message) < 20: return summary = ( f"User said: {user_message[:200]}\n" f"Agent replied: {agent_response[:300]}" ) kx.remember( content=summary, agent_id=user_id, memory_type="episodic", session_id=session_id, ttl_days=90 )
/chat route. It handles context injection, LLM generation, and memory storage — all in one request cycle.from fastapi import FastAPI, HTTPException from pydantic import BaseModel from openai import AsyncOpenAI from context import build_context, store_interaction import asyncio, os app = FastAPI() oai = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY")) SYSTEM_BASE = """You are a helpful customer support agent for Kronvex. You have access to product documentation and the user's conversation history. Be concise, direct, and personal. If you have past context about this user, use it. Do NOT repeat information the user already knows.""" class ChatRequest(BaseModel): user_id: str message: str session_id: str = "default" conversation: list[dict] = [] # last N turns class ChatResponse(BaseModel): reply: str memories_used: int = 0 rag_used: bool = False @app.post("/chat", response_model=ChatResponse) async def chat(req: ChatRequest): # 1. Build context (RAG + memory) — non-blocking context_block = await build_context(req.user_id, req.message) # 2. Compose system prompt system = SYSTEM_BASE if context_block: system += f"\n\n{context_block}" # 3. Build messages array messages = [{"role": "system", "content": system}] messages.extend(req.conversation[-6:]) # last 3 turns messages.append({"role": "user", "content": req.message}) # 4. Call LLM completion = await oai.chat.completions.create( model="gpt-4o", messages=messages, max_tokens=600, temperature=0.4 ) reply = completion.choices[0].message.content # 5. Store interaction async — don't block the response asyncio.create_task( store_interaction(req.user_id, req.message, reply, req.session_id) ) return ChatResponse( reply=reply, memories_used=1 if context_block else 0, rag_used="PRODUCT DOCUMENTATION" in context_block ) if __name__ == "__main__": import uvicorn uvicorn.run(app, host="0.0.0.0", port=8000)
def store_user_profile(user_id: str, user_data: dict): """Call this on signup / first login.""" facts = [ (f"User plan: {user_data['plan']}", "semantic"), (f"User name: {user_data['name']}", "semantic"), (f"Joined: {user_data['created_at']}", "semantic"), (f"Company: {user_data.get('company','—')}", "semantic"), ] for content, mtype in facts: kx.remember( content=content, agent_id=user_id, memory_type=mtype, pinned=True # pinned = never expires, always recalled )
pinned=True for critical facts — plan, name, company. These bypass TTL and are always returned first in any recall. Reserve pinned for facts that are always relevant, not just frequently mentioned.uvicorn main:app --reload
# First message — agent has no memory yet curl -X POST http://localhost:8000/chat \ -H "Content-Type: application/json" \ -d '{ "user_id": "user_alice", "session_id": "sess_001", "message": "I keep hitting rate limits on the /recall endpoint" }' # New session — agent remembers from last time curl -X POST http://localhost:8000/chat \ -H "Content-Type: application/json" \ -d '{ "user_id": "user_alice", "session_id": "sess_002", "message": "Still having that issue" }' # → Agent responds: "Still hitting the /recall rate limits? # Alice, you're on the Starter plan (1k req/day limit)..."
Production checklist
Before you ship this to real users, a few things to add:
- Auth middleware — verify that
user_idin the request matches the authenticated user. Never let clients set their own user_id freely. - Rate limit the /chat endpoint — one user shouldn't be able to burn your entire Kronvex quota. 20 req/min per user is a sensible default.
- Memory deduplication — if a user says "I'm on the Pro plan" in 10 conversations, you don't want 10 identical semantic memories. Check before storing, or use
pinned=Truefor facts that shouldn't duplicate. - TTL strategy — episodic memories (conversations) expire in 90 days by default. Semantic facts (plan, name) should be pinned or given long TTLs. Procedural rules (agent behavior for this user) should be pinned.
- Context budget — log
memories_usedandrag_usedper request. If you're consistently hitting context limits, reducetop_kon either RAG or Kronvex. - Error handling — both
build_context()andstore_interaction()are wrapped in try/except. A memory failure should never break a chat response.
What the memory looks like after a few conversations
Here's what Kronvex stores for a user after 3 support sessions:
[
{
"content": "User plan: Pro",
"memory_type": "semantic",
"pinned": true,
"access_count": 14
},
{
"content": "Session sess_001: User hit rate limits on /recall. On Starter plan at the time.",
"memory_type": "episodic",
"session_id": "sess_001",
"ttl_days": 90,
"access_count": 3
},
{
"content": "Session sess_002: User upgraded to Pro. Rate limit issue resolved.",
"memory_type": "episodic",
"session_id": "sess_002",
"ttl_days": 90,
"access_count": 1
},
{
"content": "User prefers bullet-point answers over long paragraphs.",
"memory_type": "procedural",
"pinned": true,
"access_count": 8
}
]
When Alice opens a new conversation, inject_context() retrieves the most relevant of these (by semantic similarity to her first message) and injects them as a system prompt block — automatically, before the LLM ever sees her message.
Next steps
- Add session explorer — surface a "conversation history" panel to users in your UI, powered by
GET /agents/{id}/sessions - Memory-driven personalization — if a user's procedural memory says "prefers Python", always default code examples to Python without asking
- Escalation memory — if the agent can't resolve an issue, store it as pinned semantic memory so a human agent picks it up with full context
- Webhooks — coming to Kronvex soon: fire a webhook on every
remember()call to sync with your CRM in real-time
Ready to build?
Get your free API key — 100 memories, 1 agent, no credit card. Under 5 minutes to first memory stored.
Get free API key →