Your AI Agent Is Leaking Intelligence (And Nobody’s Talking About It)

Notes from Juan Pablo Garcia Gonzalez’s session at AgentCon Boston 2026

Jun 13, 2026

There’s a dirty secret in AI engineering that nobody puts in the marketing copy.

That 128K context window your LLM provider is selling you? It’s not 128K of useful space. By the time your system prompt, tool definitions, conversation history, and RAG chunks are loaded in — you’re working with maybe 69K tokens of effective runway. And the model isn’t even paying equal attention to all of it.

Juan Pablo Garcia Gonzalez, AWS Solution Architect, spent an hour at AgentCon Boston 2026 making a room full of engineers deeply uncomfortable about context windows. By the end, he’d handed them a pattern that IBM used to cut 20 million tokens down to 1,000.

That’s not a typo. 20,000x reduction.

The 128K Lie

The first thing Juan Pablo put on the screen was a math problem.

128K — advertised context window
Minus 4K — system prompt
Minus 15K — tool definitions (MCP servers, function schemas)
Minus 20K — conversation history (a real all-day session)
Minus 20K — RAG chunks retrieved per turn
Equals 69K — Maximum Effective Context Window (MECW)

He calls this the MECW — Maximum Effective Context Window. There’s an arxiv paper (2509.21361) formalizing it. The headline: what vendors advertise and what engineers actually get are very different numbers.

And it gets worse. Even within those 69K tokens, the model doesn’t treat all positions equally. He showed the U-curve — what researchers call “Lost in the Middle.” Content at the very beginning and very end of the context gets the highest attention weights. Content in the middle? It fades. The model technically reads it, but retention drops off sharply.

For production applications — customer service bots running 8-hour shifts, contract review tools processing multi-document deals, code agents maintaining session state across a full sprint — this isn’t a theoretical concern. It’s a daily failure mode.

The Three Horizons (And Why You’re Only Engineering One)

Most teams think about context as a single thing: what’s in the prompt right now.

Juan Pablo broke it into three distinct time horizons:

Horizon 1 — The Current Turn. What’s in the active context window. This is what everyone obsesses over.
Horizon 2 — The Session. Everything since the conversation started — tool call history, intermediate results, clarifications. This is what fills up and gets dropped as the session grows.
Horizon 3 — Across Sessions. Persistent knowledge — user preferences, past decisions, institutional memory. This is what almost nobody is engineering for.

The U-curve kills you in Horizon 1. Token budget collapse kills you in Horizon 2. And ignoring Horizon 3 means every new session starts cold, making users repeat themselves and agents re-derive decisions that were already made.

The Demo-to-Production Cliff

Here’s the trap. You build a beautiful demo. Two turns. The agent retrieves some context, generates a clean response, and the investors are impressed.

Then you ship to production.

Real users run 40-turn sessions. They upload documents, ask follow-up questions, change their minds, come back the next day. The context window fills up. The model starts losing the middle. Tool calls pile up. The agent starts spiraling — calling the same search function repeatedly, reasoning itself into loops with no exit condition.

The demo worked because it never got long enough to fail.

Juan Pablo called this the demo-to-production cliff. The cliff isn’t a bug. It’s a physics problem. And physics problems need architecture solutions, not prompt tweaks.

MCP Sprawl: The Hidden Context Tax

One quick aside that landed hard in the room: MCP sprawl.

The Model Context Protocol — the tooling standard that lets agents call external services — has become the hot thing to add. Give your agent Slack access, calendar access, GitHub access, Jira access, Notion access...

Each connected MCP server doesn’t just add capability. It adds its full tool schema to the context window on every single call. A team running 12 MCP servers might be burning 15K-20K tokens on tool definitions before the model reads a single word of actual content.

The lesson isn’t “don’t use MCP.” It’s that tools have a context cost. Be deliberate. Only load what the agent actually needs for the task at hand.

The Memory Pointer Pattern

This is the part that made everyone in the room reach for their notebooks.

The naive solution to context overflow is summarization. Keep a rolling summary of what happened. When the context gets full, compress and continue.

The problem with summarization is that it’s lossy by definition. You’re trusting the LLM to decide what matters. In casual conversation, that’s fine. In legal contracts, incident response logs, financial records, or anywhere precision matters — it’s not.

A summarized contract clause might say “the IP provision protects company assets.” The actual clause says:

“All intellectual property developed using any Company resource, including personal devices connected to Company networks, shall be exclusively owned by the Company, without exception or compensation.”

The difference between those two descriptions could be worth millions of dollars in litigation. Summarization ate the nuance that matters.

The Memory Pointer Pattern solves this differently.

Instead of compressing content into the context window, you move the content out of the context entirely — into an external store (a vector database, S3, Azure AI Search, a document store). What stays in the context is just a pointer:

type: application_log
id: prod-2026-06-13-api-errors
location: s3://logs/2026/06/13/api.log
size_bytes: 204800000

That pointer is 52 bytes. Roughly 13 tokens. The full log it references is 200K tokens — 56 bytes per token, well over the context window.

When the agent needs a specific log line, it calls a tool with the pointer. The tool fetches exactly what’s needed. The agent never holds the full log; it only ever holds what it’s actively working with right now.

Full fidelity. Zero compression. Minimal context cost.

IBM’s 20,000x Number

Juan Pablo saved the IBM research for the end.

A team at IBM applied the Memory Pointer Pattern to an enterprise agentic workflow. Before: 20 million tokens used per session. After: 1,000 tokens.

Twenty million to one thousand. A 20,000x reduction in context consumption.

The workflow didn’t get less capable. In fact, it got more reliable — because the agents were no longer operating in degraded attention zones, and every tool call retrieved exact data rather than hoping a compressed summary had preserved the right detail.

What This Means for the Stack

If you’re building production agentic systems right now, here’s what Juan Pablo’s session translates to:

Don’t summarize, externalize. Move large content out of context. Keep pointers in context. Retrieve on demand.
Audit your MECW. Before you assume you have 128K to work with, subtract your system prompt, tool schemas, and expected conversation depth. What’s left is your real budget. Design for that number.
Engineer all three horizons. Current turn, session, and across-sessions are different problems requiring different solutions. Most teams are only thinking about the first one.
Be deliberate with MCP. Every tool schema has a token cost. Load only what the current task requires.
Test at production length. A 2-turn demo proves nothing about 40-turn production. Load test your context, not just your infrastructure.

The Uncomfortable Truth

The 128K context window was sold to us as “effectively unlimited.” Juan Pablo’s talk was a thorough demolition of that idea.

But the architecture patterns that survive this constraint aren’t exotic or complex. They’re principled. Move big things out, keep pointers in, retrieve exactly what you need when you need it. Systems that do this well will outperform systems that rely on hoping the model’s attention is pointed at the right paragraph.

The engineers who understand this constraint deeply — and build around it — are the ones shipping agentic systems that actually work after lunch on day one.

AgentCon Boston, June 13, 2026. Juan Pablo Garcia Gonzalez, AWS Solution Architect.

MECW paper: arxiv.org/abs/2509.21361

Discussion about this post

Ready for more?