Building a Production AI Agent with Memory

Most AI support agents are stateless. They're fast, they know your docs, but every conversation starts from scratch. The user has to re-explain who they are, what plan they're on, what they tried last week. It feels like talking to a goldfish with a great knowledge base.

This walkthrough adds memory. By the end you'll have an agent that remembers every user across sessions, personalizes answers based on their history, and stores new context automatically after every conversation.

Stack: FastAPI · Kronvex · OpenAI GPT-4o · your existing vector DB for RAG. The memory layer is swappable — the patterns work with any LLM.

What we're building

USER

User sends message"I have the same billing issue as last week"

→

POST /chat{ user_id, message, session_id }

CONTEXT

RAG — product knowledgeTop 3 relevant doc chunks from your knowledge base

Kronvex — user memoryinject_context() → past interactions, preferences, known issues

LLM

GPT-4o generates responseWith full context: docs + user history + current message

→

Response + storeAnswer sent · interaction stored as episodic memory

Prerequisites

Kronvex API key — free tier at kronvex.io (500 memories, 3 agents)
OpenAI API key — or any OpenAI-compatible LLM endpoint
A vector DB with your docs — Supabase pgvector, Pinecone, Weaviate, etc.
Python 3.10+ and pip

💡

Don't have a vector DB yet? Skip the RAG step for now — you can start with memory only and add RAG later. The agent is still 10x better than stateless even without docs.

Step-by-step

Install dependencies

One command gets you everything needed.

BASH

pip install fastapi uvicorn openai kronvex httpx python-dotenv

Configure environment

Create a .env file — never commit this.

ENV.env

KRONVEX_API_KEY=kx_live_your_key_here
KRONVEX_AGENT_ID=agent_support_001
OPENAI_API_KEY=sk-your_openai_key
# Optional — your RAG endpoint
RAG_ENDPOINT=https://your-rag-api.example.com/search

Build the memory + context layer

This is the core module. It handles both RAG retrieval and Kronvex memory injection, then merges them into a single system prompt block.

PYTHONcontext.py

import os
import httpx
from kronvex import KronvexClient

kx = KronvexClient(api_key=os.getenv("KRONVEX_API_KEY"))
AGENT_ID = os.getenv("KRONVEX_AGENT_ID")
RAG_ENDPOINT = os.getenv("RAG_ENDPOINT")


async def get_rag_chunks(query: str, top_k: int = 3) -> str:
    """Retrieve relevant doc chunks from your vector DB."""
    if not RAG_ENDPOINT:
        return ""
    async with httpx.AsyncClient() as client:
        r = await client.post(RAG_ENDPOINT, json={
            "query": query, "top_k": top_k
        }, timeout=4.0)
        chunks = r.json().get("results", [])
        return "\n\n".join(c.get("text", "") for c in chunks)


async def build_context(user_id: str, message: str) -> str:
    """Build the full context block: RAG docs + user memories."""
    sections = []

    # 1. RAG — shared product knowledge
    try:
        docs = await get_rag_chunks(message)
        if docs:
            sections.append(f"[PRODUCT DOCUMENTATION]\n{docs}")
    except Exception:
        pass  # RAG failure is non-fatal

    # 2. Kronvex — user-specific memory
    try:
        ctx = kx.inject_context(
            message=message,
            agent_id=user_id,
            threshold=0.65,
            top_k=5
        )
        if ctx.memories_used > 0:
            sections.append(ctx.context_block)
    except Exception:
        pass  # Memory failure is non-fatal

    return "\n\n".join(sections)


async def store_interaction(
    user_id: str,
    user_message: str,
    agent_response: str,
    session_id: str
):
    """Store the interaction as episodic memory after responding."""
    # Only store substantive exchanges (skip "thanks", "ok", etc.)
    if len(user_message) < 20:
        return

    summary = (
        f"User said: {user_message[:200]}\n"
        f"Agent replied: {agent_response[:300]}"
    )
    kx.remember(
        content=summary,
        agent_id=user_id,
        memory_type="episodic",
        session_id=session_id,
        ttl_days=90
    )

Build the FastAPI endpoint

The main /chat route. It handles context injection, LLM generation, and memory storage — all in one request cycle.

PYTHONmain.py

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from openai import AsyncOpenAI
from context import build_context, store_interaction
import asyncio, os

app = FastAPI()
oai = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))

SYSTEM_BASE = """You are a helpful customer support agent for Kronvex.
You have access to product documentation and the user's conversation history.
Be concise, direct, and personal. If you have past context about this user, use it.
Do NOT repeat information the user already knows."""


class ChatRequest(BaseModel):
    user_id: str
    message: str
    session_id: str = "default"
    conversation: list[dict] = []  # last N turns


class ChatResponse(BaseModel):
    reply: str
    memories_used: int = 0
    rag_used: bool = False


@app.post("/chat", response_model=ChatResponse)
async def chat(req: ChatRequest):
    # 1. Build context (RAG + memory) — non-blocking
    context_block = await build_context(req.user_id, req.message)

    # 2. Compose system prompt
    system = SYSTEM_BASE
    if context_block:
        system += f"\n\n{context_block}"

    # 3. Build messages array
    messages = [{"role": "system", "content": system}]
    messages.extend(req.conversation[-6:])  # last 3 turns
    messages.append({"role": "user", "content": req.message})

    # 4. Call LLM
    completion = await oai.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        max_tokens=600,
        temperature=0.4
    )
    reply = completion.choices[0].message.content

    # 5. Store interaction async — don't block the response
    asyncio.create_task(
        store_interaction(req.user_id, req.message, reply, req.session_id)
    )

    return ChatResponse(
        reply=reply,
        memories_used=1 if context_block else 0,
        rag_used="PRODUCT DOCUMENTATION" in context_block
    )


if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Add user-specific memory at onboarding

When a user signs up or logs in for the first time, store their key facts. This gives the agent immediate context before any conversation happens.

PYTHONonboarding.py

def store_user_profile(user_id: str, user_data: dict):
    """Call this on signup / first login."""
    facts = [
        (f"User plan: {user_data['plan']}",         "semantic"),
        (f"User name: {user_data['name']}",          "semantic"),
        (f"Joined: {user_data['created_at']}",       "semantic"),
        (f"Company: {user_data.get('company','—')}", "semantic"),
    ]
    for content, mtype in facts:
        kx.remember(
            content=content,
            agent_id=user_id,
            memory_type=mtype,
            pinned=True  # pinned = never expires, always recalled
        )

📌

Use pinned=True for critical facts — plan, name, company. These bypass TTL and are always returned first in any recall. Reserve pinned for facts that are always relevant, not just frequently mentioned.

Run it

Start the server and test with curl.

BASH

uvicorn main:app --reload

BASH — TEST

# First message — agent has no memory yet
curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{
    "user_id": "user_alice",
    "session_id": "sess_001",
    "message": "I keep hitting rate limits on the /recall endpoint"
  }'

# New session — agent remembers from last time
curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{
    "user_id": "user_alice",
    "session_id": "sess_002",
    "message": "Still having that issue"
  }'
# → Agent responds: "Still hitting the /recall rate limits?
#   Alice, you're on the Starter plan (1k req/day limit)..."

Production checklist

Before you ship this to real users, a few things to add:

Auth middleware — verify that user_id in the request matches the authenticated user. Never let clients set their own user_id freely.
Rate limit the /chat endpoint — one user shouldn't be able to burn your entire Kronvex quota. 20 req/min per user is a sensible default.
Memory deduplication — if a user says "I'm on the Pro plan" in 10 conversations, you don't want 10 identical semantic memories. Check before storing, or use pinned=True for facts that shouldn't duplicate.
TTL strategy — episodic memories (conversations) expire in 90 days by default. Semantic facts (plan, name) should be pinned or given long TTLs. Procedural rules (agent behavior for this user) should be pinned.
Context budget — log memories_used and rag_used per request. If you're consistently hitting context limits, reduce top_k on either RAG or Kronvex.
Error handling — both build_context() and store_interaction() are wrapped in try/except. A memory failure should never break a chat response.

What the memory looks like after a few conversations

Here's what Kronvex stores for a user after 3 support sessions:

JSON — USER MEMORIES (alice)

[
  {
    "content": "User plan: Pro",
    "memory_type": "semantic",
    "pinned": true,
    "access_count": 14
  },
  {
    "content": "Session sess_001: User hit rate limits on /recall. On Starter plan at the time.",
    "memory_type": "episodic",
    "session_id": "sess_001",
    "ttl_days": 90,
    "access_count": 3
  },
  {
    "content": "Session sess_002: User upgraded to Pro. Rate limit issue resolved.",
    "memory_type": "episodic",
    "session_id": "sess_002",
    "ttl_days": 90,
    "access_count": 1
  },
  {
    "content": "User prefers bullet-point answers over long paragraphs.",
    "memory_type": "procedural",
    "pinned": true,
    "access_count": 8
  }
]

When Alice opens a new conversation, inject_context() retrieves the most relevant of these (by semantic similarity to her first message) and injects them as a system prompt block — automatically, before the LLM ever sees her message.

Next steps

Add session explorer — surface a "conversation history" panel to users in your UI, powered by GET /agents/{id}/sessions
Memory-driven personalization — if a user's procedural memory says "prefers Python", always default code examples to Python without asking
Escalation memory — if the agent can't resolve an issue, store it as pinned semantic memory so a human agent picks it up with full context
Webhooks — coming to Kronvex soon: fire a webhook on every remember() call to sync with your CRM in real-time

Building a Production Support Agent with Memory

What we're building

Prerequisites

Step-by-step

Production checklist

What the memory looks like after a few conversations

Next steps

Ready to build?