The problem: 128k tokens sounds like a lot until your agent has memory

GPT-4o has a 128,000 token context window. Claude 3 handles 200,000. Gemini Pro goes even further. Developers often assume that with context windows this large, "just dump everything in" is a reasonable strategy. It isn't — and the math makes clear why.

Consider a real-world B2B AI agent scenario:

You've burned 95% of your context window, paid for 68,500 input tokens (roughly $0.17 per call at GPT-4o pricing), and the LLM now has to find the relevant signal in 68,000 tokens of noise. And this is the good scenario — one where history fits at all. Agents that run hundreds of sessions per user will overflow the context window entirely, forcing truncation of the exact history you needed.

Cost at scale: At 10,000 daily active users, each generating 10 agent calls per day with 50k tokens of injected history, you're looking at ~$17,000/day in input token costs alone — before any output. Semantic compression typically reduces this by 90%+.

The naive approach: dump everything into context — why it fails

The "full history injection" pattern fails on four dimensions:

1. Cost grows unboundedly

Every message ever sent by the user gets injected on every call. As users accumulate history, token cost per call grows linearly. There's no ceiling until you hit the context limit — at which point it fails hard.

2. Retrieval quality degrades with noise

LLMs exhibit "lost in the middle" behavior: facts buried in the middle of a long context are systematically underweighted compared to the beginning and end. Injecting 50k tokens of history to answer a question that only requires 3 specific facts actually makes the LLM worse at finding those facts.

3. Latency increases with token count

Time-to-first-token scales with input length. A 50k-token context takes roughly 3-4x longer to process than a 5k-token context. For agents where each tool call waits for an LLM response, this compounds across the full execution tree.

4. Truncation is unpredictable

When history overflows the context limit, different frameworks truncate differently: some drop oldest messages, some drop in the middle, some raise errors. All approaches lose information silently unless you've implemented explicit tracking of what was dropped.

Semantic compression: store facts, not conversations

The core insight of semantic compression: conversations are a means to an end, not the end itself. What matters is the facts, preferences, and context that emerge from conversations — not the raw dialogue.

Consider this exchange:

Example conversation (raw)
User: Hey, so I've been thinking about our infrastructure.
      We've been on AWS for a while but honestly the costs are
      getting out of hand. We're probably going to move to
      self-hosted Kubernetes on Hetzner, should be done by Q3.

Agent: That's a significant migration. What's your current stack?

User: Mostly Python microservices, some FastAPI stuff. 10 engineers,
      we're a 50-person company overall. EU-based, compliance is
      important for us.

Agent: Makes sense. Hetzner is solid for EU workloads...

That exchange is ~150 tokens. The extractable facts worth storing are:

Extracted facts (compressed)
- Migrating from AWS to self-hosted Kubernetes on Hetzner, planned Q3
- Stack: Python microservices, FastAPI
- Team: 10 engineers, 50-person EU company
- EU compliance is a priority

That's 35 tokens — a 4x compression ratio. The semantic content is preserved. The conversational noise is eliminated. And critically, on the next query about "infrastructure recommendations", you retrieve just these 4 facts rather than the entire original conversation.

inject-context: one API call returns the N most relevant memories

Kronvex's inject_context() endpoint does the heavy lifting of semantic retrieval and formatting in a single call. You pass the current query; it returns a ready-to-inject context block containing the most relevant stored memories, scored by:

The result: instead of injecting 50,000 tokens of raw history, you inject 500–2,000 tokens of the most relevant facts. Token usage drops by 95%+. LLM quality improves because the signal-to-noise ratio increases. Latency drops because the context is shorter.

Python — inject-context
from kronvex import Kronvex

kv = Kronvex(api_key="kv-your-key")
agent = kv.agent("user_42")

# One call, returns formatted context block
context = agent.inject_context(
    query="What database should we use for this new service?",
    top_k=8  # Return up to 8 most relevant memories
)

# context is a pre-formatted string, ready to paste into system prompt:
# "- Migrating from AWS to Kubernetes on Hetzner by Q3
#  - Stack: Python microservices, FastAPI
#  - EU compliance is a priority
#  - Prefers open-source tools over vendor lock-in
#  - Has used PostgreSQL on previous projects"

print(context)
print(f"Token count: ~{len(context.split()) * 1.3:.0f} tokens")

Tiered memory: working memory (context window) vs long-term memory (Kronvex)

The right mental model for agent memory is a two-tier system, analogous to RAM and disk:

The workflow for every agent turn:

  1. Start of turn: Call inject_context(current_query) → get long-term memory relevant to this query
  2. Build prompt: Combine injected memory + current conversation (working memory) + system instructions
  3. LLM call: Execute with the combined context
  4. End of turn: Extract facts worth promoting from working memory to long-term memory via remember()

The promotion decision is critical. Not everything in working memory deserves promotion to long-term memory. Ephemeral facts (current task state, one-time requests) should stay in working memory only. Durable facts (preferences, plans, biographical context) should be promoted. A lightweight LLM call after each turn can make this decision automatically.

Practical rules: what to store, what to discard, TTL strategy

What to store

What to discard

TTL strategy

Not all memories should live forever. A well-designed TTL policy:

Code examples

Python — Full context-aware agent turn
from openai import OpenAI
from kronvex import Kronvex
import json

openai_client = OpenAI(api_key="sk-your-key")
kv = Kronvex(api_key="kv-your-key")

EXTRACT_PROMPT = """You are extracting durable facts from a conversation turn.
Return JSON: {"store": ["fact1", "fact2"], "ttl_days": {"fact1": null, "fact2": 90}}
Use null TTL for permanent facts. Only extract user-specific, durable information.
Return {"store": [], "ttl_days": {}} if nothing is worth storing.

User: {user_msg}
Assistant: {assistant_msg}"""

def agent_turn(user_id: str, user_message: str, conversation_history: list) -> str:
    agent = kv.agent(user_id)

    # TIER 1: Retrieve long-term memory (semantic, top 8)
    long_term_context = agent.inject_context(user_message, top_k=8)

    # Build prompt: long-term memory + working memory (conversation) + instructions
    system_prompt = f"""You are a helpful B2B AI assistant.

LONG-TERM MEMORY (user context from past sessions):
{long_term_context or "No prior context."}

Use the above to personalize your response. Do not mention you have memory
unless directly asked."""

    messages = [{"role": "system", "content": system_prompt}]

    # TIER 2: Working memory = current conversation (last 20 turns max)
    messages.extend(conversation_history[-20:])
    messages.append({"role": "user", "content": user_message})

    # LLM call
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        max_tokens=1500
    )
    assistant_reply = response.choices[0].message.content

    # Promote durable facts to long-term memory (async in production)
    _extract_and_promote(agent, user_message, assistant_reply)

    return assistant_reply


def _extract_and_promote(agent, user_msg: str, assistant_msg: str):
    """Extract facts from turn and store to Kronvex with appropriate TTL."""
    result = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": EXTRACT_PROMPT.format(
            user_msg=user_msg, assistant_msg=assistant_msg
        )}],
        response_format={"type": "json_object"},
        max_tokens=500
    )
    try:
        data = json.loads(result.choices[0].message.content)
        for fact in data.get("store", []):
            ttl = data.get("ttl_days", {}).get(fact)
            # Pass ttl_days to remember() if your plan supports it
            agent.remember(fact, ttl_days=ttl)
    except Exception:
        pass
Python — Token budget monitor
import tiktoken

def estimate_tokens(text: str, model: str = "gpt-4o") -> int:
    enc = tiktoken.encoding_for_model(model)
    return len(enc.encode(text))

def build_context_within_budget(
    system_prompt: str,
    long_term_context: str,
    conversation_history: list,
    user_message: str,
    max_input_tokens: int = 50_000,
    response_budget: int = 4_000
) -> tuple[list, dict]:
    """Build messages list that fits within token budget."""
    available = max_input_tokens - response_budget

    # Fixed components
    system_tokens = estimate_tokens(system_prompt)
    context_tokens = estimate_tokens(long_term_context)
    user_tokens = estimate_tokens(user_message)
    overhead = system_tokens + context_tokens + user_tokens + 200  # safety margin

    # Working memory: fit as many recent turns as possible
    history_budget = available - overhead
    history_messages = []
    history_tokens = 0

    for msg in reversed(conversation_history):
        msg_tokens = estimate_tokens(msg["content"])
        if history_tokens + msg_tokens > history_budget:
            break
        history_messages.insert(0, msg)
        history_tokens += msg_tokens

    full_system = f"{system_prompt}\n\nLONG-TERM MEMORY:\n{long_term_context}"
    messages = [{"role": "system", "content": full_system}]
    messages.extend(history_messages)
    messages.append({"role": "user", "content": user_message})

    stats = {
        "total_input_tokens": overhead + history_tokens,
        "history_turns_included": len(history_messages),
        "history_turns_dropped": len(conversation_history) - len(history_messages),
        "long_term_memories_tokens": context_tokens
    }

    return messages, stats