The problem: 128k tokens sounds like a lot until your agent has memory
GPT-4o has a 128,000 token context window. Claude 3 handles 200,000. Gemini Pro goes even further. Developers often assume that with context windows this large, "just dump everything in" is a reasonable strategy. It isn't — and the math makes clear why.
Consider a real-world B2B AI agent scenario:
- System prompt: 500 tokens (instructions, persona, tools)
- User history dump: 50,000 tokens (100 past conversations, ~500 tokens each)
- Current conversation: 10,000 tokens (20 turns of active dialogue)
- RAG results: 8,000 tokens (top 5 documents)
- Response budget: ~4,000 tokens remaining
You've burned 95% of your context window, paid for 68,500 input tokens (roughly $0.17 per call at GPT-4o pricing), and the LLM now has to find the relevant signal in 68,000 tokens of noise. And this is the good scenario — one where history fits at all. Agents that run hundreds of sessions per user will overflow the context window entirely, forcing truncation of the exact history you needed.
Cost at scale: At 10,000 daily active users, each generating 10 agent calls per day with 50k tokens of injected history, you're looking at ~$17,000/day in input token costs alone — before any output. Semantic compression typically reduces this by 90%+.
The naive approach: dump everything into context — why it fails
The "full history injection" pattern fails on four dimensions:
1. Cost grows unboundedly
Every message ever sent by the user gets injected on every call. As users accumulate history, token cost per call grows linearly. There's no ceiling until you hit the context limit — at which point it fails hard.
2. Retrieval quality degrades with noise
LLMs exhibit "lost in the middle" behavior: facts buried in the middle of a long context are systematically underweighted compared to the beginning and end. Injecting 50k tokens of history to answer a question that only requires 3 specific facts actually makes the LLM worse at finding those facts.
3. Latency increases with token count
Time-to-first-token scales with input length. A 50k-token context takes roughly 3-4x longer to process than a 5k-token context. For agents where each tool call waits for an LLM response, this compounds across the full execution tree.
4. Truncation is unpredictable
When history overflows the context limit, different frameworks truncate differently: some drop oldest messages, some drop in the middle, some raise errors. All approaches lose information silently unless you've implemented explicit tracking of what was dropped.
Semantic compression: store facts, not conversations
The core insight of semantic compression: conversations are a means to an end, not the end itself. What matters is the facts, preferences, and context that emerge from conversations — not the raw dialogue.
Consider this exchange:
User: Hey, so I've been thinking about our infrastructure.
We've been on AWS for a while but honestly the costs are
getting out of hand. We're probably going to move to
self-hosted Kubernetes on Hetzner, should be done by Q3.
Agent: That's a significant migration. What's your current stack?
User: Mostly Python microservices, some FastAPI stuff. 10 engineers,
we're a 50-person company overall. EU-based, compliance is
important for us.
Agent: Makes sense. Hetzner is solid for EU workloads...
That exchange is ~150 tokens. The extractable facts worth storing are:
- Migrating from AWS to self-hosted Kubernetes on Hetzner, planned Q3
- Stack: Python microservices, FastAPI
- Team: 10 engineers, 50-person EU company
- EU compliance is a priority
That's 35 tokens — a 4x compression ratio. The semantic content is preserved. The conversational noise is eliminated. And critically, on the next query about "infrastructure recommendations", you retrieve just these 4 facts rather than the entire original conversation.
inject-context: one API call returns the N most relevant memories
Kronvex's inject_context() endpoint does the heavy lifting of semantic retrieval and formatting in a single call. You pass the current query; it returns a ready-to-inject context block containing the most relevant stored memories, scored by:
- Semantic similarity (60% weight) — cosine similarity between the query embedding and stored memory embeddings via pgvector
- Recency (20% weight) — sigmoid function with 30-day inflection point; recent memories score higher
- Frequency (20% weight) — log-scaled access count; memories recalled often are presumed important
The result: instead of injecting 50,000 tokens of raw history, you inject 500–2,000 tokens of the most relevant facts. Token usage drops by 95%+. LLM quality improves because the signal-to-noise ratio increases. Latency drops because the context is shorter.
from kronvex import Kronvex
kv = Kronvex(api_key="kv-your-key")
agent = kv.agent("user_42")
# One call, returns formatted context block
context = agent.inject_context(
query="What database should we use for this new service?",
top_k=8 # Return up to 8 most relevant memories
)
# context is a pre-formatted string, ready to paste into system prompt:
# "- Migrating from AWS to Kubernetes on Hetzner by Q3
# - Stack: Python microservices, FastAPI
# - EU compliance is a priority
# - Prefers open-source tools over vendor lock-in
# - Has used PostgreSQL on previous projects"
print(context)
print(f"Token count: ~{len(context.split()) * 1.3:.0f} tokens")
Tiered memory: working memory (context window) vs long-term memory (Kronvex)
The right mental model for agent memory is a two-tier system, analogous to RAM and disk:
- Working memory (context window) — the current conversation, active tool outputs, immediate task context. Fast, but limited and ephemeral. Cleared at the end of each session.
- Long-term memory (Kronvex) — distilled facts about the user, their preferences, context, and history. Unlimited in practice, persistent across sessions, retrieved semantically.
The workflow for every agent turn:
- Start of turn: Call
inject_context(current_query)→ get long-term memory relevant to this query - Build prompt: Combine injected memory + current conversation (working memory) + system instructions
- LLM call: Execute with the combined context
- End of turn: Extract facts worth promoting from working memory to long-term memory via
remember()
The promotion decision is critical. Not everything in working memory deserves promotion to long-term memory. Ephemeral facts (current task state, one-time requests) should stay in working memory only. Durable facts (preferences, plans, biographical context) should be promoted. A lightweight LLM call after each turn can make this decision automatically.
Practical rules: what to store, what to discard, TTL strategy
What to store
- User preferences ("prefers concise answers", "uses dark mode", "speaks French")
- Technical context ("their stack is FastAPI + PostgreSQL", "deployed on Hetzner")
- Business context ("50-person EU company", "Q3 migration deadline", "GDPR compliance required")
- Decisions made ("chose PostgreSQL over MongoDB", "decided against vendor X")
- Goals and ongoing projects ("building a CRM integration", "migrating to Kubernetes")
What to discard
- Greetings and pleasantries
- One-time task requests that were completed ("write me a Python script for X")
- Questions (store the answers, not the questions)
- General knowledge that isn't user-specific
- Temporary state ("currently thinking about X" — use short TTL instead)
TTL strategy
Not all memories should live forever. A well-designed TTL policy:
- Permanent (no TTL): Core preferences, identity facts ("speaks French", "EU company")
- Long-lived (1 year): Technical stack, team size, business context
- Medium (90 days): Active projects, ongoing goals
- Short (30 days): Tactical context ("currently evaluating vendor X")
- Ephemeral (7 days): Temporary states, speculative plans
Code examples
from openai import OpenAI
from kronvex import Kronvex
import json
openai_client = OpenAI(api_key="sk-your-key")
kv = Kronvex(api_key="kv-your-key")
EXTRACT_PROMPT = """You are extracting durable facts from a conversation turn.
Return JSON: {"store": ["fact1", "fact2"], "ttl_days": {"fact1": null, "fact2": 90}}
Use null TTL for permanent facts. Only extract user-specific, durable information.
Return {"store": [], "ttl_days": {}} if nothing is worth storing.
User: {user_msg}
Assistant: {assistant_msg}"""
def agent_turn(user_id: str, user_message: str, conversation_history: list) -> str:
agent = kv.agent(user_id)
# TIER 1: Retrieve long-term memory (semantic, top 8)
long_term_context = agent.inject_context(user_message, top_k=8)
# Build prompt: long-term memory + working memory (conversation) + instructions
system_prompt = f"""You are a helpful B2B AI assistant.
LONG-TERM MEMORY (user context from past sessions):
{long_term_context or "No prior context."}
Use the above to personalize your response. Do not mention you have memory
unless directly asked."""
messages = [{"role": "system", "content": system_prompt}]
# TIER 2: Working memory = current conversation (last 20 turns max)
messages.extend(conversation_history[-20:])
messages.append({"role": "user", "content": user_message})
# LLM call
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=messages,
max_tokens=1500
)
assistant_reply = response.choices[0].message.content
# Promote durable facts to long-term memory (async in production)
_extract_and_promote(agent, user_message, assistant_reply)
return assistant_reply
def _extract_and_promote(agent, user_msg: str, assistant_msg: str):
"""Extract facts from turn and store to Kronvex with appropriate TTL."""
result = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": EXTRACT_PROMPT.format(
user_msg=user_msg, assistant_msg=assistant_msg
)}],
response_format={"type": "json_object"},
max_tokens=500
)
try:
data = json.loads(result.choices[0].message.content)
for fact in data.get("store", []):
ttl = data.get("ttl_days", {}).get(fact)
# Pass ttl_days to remember() if your plan supports it
agent.remember(fact, ttl_days=ttl)
except Exception:
pass
import tiktoken
def estimate_tokens(text: str, model: str = "gpt-4o") -> int:
enc = tiktoken.encoding_for_model(model)
return len(enc.encode(text))
def build_context_within_budget(
system_prompt: str,
long_term_context: str,
conversation_history: list,
user_message: str,
max_input_tokens: int = 50_000,
response_budget: int = 4_000
) -> tuple[list, dict]:
"""Build messages list that fits within token budget."""
available = max_input_tokens - response_budget
# Fixed components
system_tokens = estimate_tokens(system_prompt)
context_tokens = estimate_tokens(long_term_context)
user_tokens = estimate_tokens(user_message)
overhead = system_tokens + context_tokens + user_tokens + 200 # safety margin
# Working memory: fit as many recent turns as possible
history_budget = available - overhead
history_messages = []
history_tokens = 0
for msg in reversed(conversation_history):
msg_tokens = estimate_tokens(msg["content"])
if history_tokens + msg_tokens > history_budget:
break
history_messages.insert(0, msg)
history_tokens += msg_tokens
full_system = f"{system_prompt}\n\nLONG-TERM MEMORY:\n{long_term_context}"
messages = [{"role": "system", "content": full_system}]
messages.extend(history_messages)
messages.append({"role": "user", "content": user_message})
stats = {
"total_input_tokens": overhead + history_tokens,
"history_turns_included": len(history_messages),
"history_turns_dropped": len(conversation_history) - len(history_messages),
"long_term_memories_tokens": context_tokens
}
return messages, stats