How do you cure an AI agent of amnesia?
Our Builder works in long sessions: it plans, writes code, runs it, fixes errors. Context piles up with every step, and the further a run goes, the more often the Builder quietly skips rules from its instructions that it followed perfectly at the start.
It wasn't being lazy, it was being an LLM. A system prompt sits at the very start of the context, and Builder runs are long: after a few dozen steps any single rule is buried far behind, right in the zone where model attention is weakest (the dip that the "Lost in the Middle" paper mapped). On top of that, every rule competes with dozens of neighbors, and instruction following measurably degrades as the instruction count grows: benchmarks like IFScale and ManyIFEval map exactly this decay across frontier models.
Our internal benchmark Agentplace Arena showed which dropped rule cost us the most. The Builder is supposed to offer the user a sub-agent review at two points: of the spec after planning, and of the code after implementation. Runs where these reviews actually happened scored noticeably higher than runs that rushed straight to "done". And the Builder kept forgetting to offer them.
The fix wasn't a bigger system prompt. It was system reminders: short notes injected into the conversation between steps, scoped to what the Builder is doing right now. The instruction lands near the end of the context, exactly where the model is actually looking. The review rule became two such reminders, each firing exactly at the moment it's needed.
One practical note: keep reminders tiny. 3 to 5 bullets works best for us. A wall of text gets ignored exactly like the system prompt did. And this is not our invention. Claude Code does the same thing in plan mode: short system reminders injected into the conversation help the model keep its focus.
Amnesia isn't cured by repeating yourself louder. It's cured by repeating yourself at the right moment.
Fully agree. Benchmarks answer one narrow question: did this change make the agent better or worse before it ships. They say nothing about what real users hit in production.
That's why we work both ends. Arena catches regressions pre-release, and once an agent is live we track real usage through its traces. The builder can read your agent's own traces, point to the exact step where a run went wrong, and propose a fix right in the chat where you built it.
Scores get an agent out the door. Traces tell you what happens after.
How do you know your latest change actually made your AI agent better, and not just different?
For general-purpose agents the answer is public benchmarks. Claude Code, Codex, Gemini CLI and friends are measured on SWE-bench Verified, Terminal-Bench, tau-bench, GAIA, OSWorld. Run the suite before and after, compare numbers.
For narrow agents it's even simpler. An agent that fills out tax forms from documents? Your benchmark is your own data: 50 documents in, 50 expected forms out.
Our case is stuck in the middle. Our Builder is an agent that builds other agents. SWE-bench doesn't fit: solving GitHub issues says nothing about whether it can design tools, skills and prompts for a working assistant. Comparing its output against "reference code" doesn't work either, because the same agent can be correctly built in dozens of ways.
So we made our own benchmark, Agentplace Arena, inspired by tau-bench. The idea: stop judging the Builder's code and judge the agent it produces.
Here's how it works. We wrote Meridian, a fake world for agents to live in: 7 REST services with flights, hotels, restaurants, a shop, email, calendar and a bank. The data looks real on purpose (actual airline names, Tesco and Pret in bank transactions), so the agent can't tell it's in a sandbox. The Builder gets the API docs and one job: build a personal assistant for this world, choosing the tools and skills itself.
Then an LLM plays a picky user across a set of tasks. Two examples. "Cancel my round trip": will the agent remember both legs and the refund rules? "Check my inbox for anything that needs action": one email asks to confirm a hotel booking, but it sits on page two of the inbox, so an agent that only skims the first page never finds it.
And the part we like most: we don't grade the conversation at all. We diff the final database state against the expected one. The agent can get there any way it likes, but the flight must be cancelled and the refund must be exact.
This loop showed us precisely where the Builder failed. We gave it a proper workflow, wrote the missing skills, fixed the prompts, and watched the scores move.
If you're building agents, steal one idea from this: grade the outcome, not the conversation. Don't judge how convincing the agent sounded in chat. Check what actually changed in the system after it finished.
New feature on the way.
Shipping an agent is only half the job. The other half is monitoring:
watching how it actually behaves once real people use it, so you can
catch the bugs early and fix them before they pile up.
IBM Research recently published Agentic CLEAR, a framework for
evaluating agent traces on three levels: system, trace, and node.
https://t.co/KkoISgCdkj
We built the same approach into Agentplace. Now the builder can read
your agent's own traces, point to where a run went wrong, and propose
the fix right there in the chat where you built the agent.
No dashboards to set up, no separate eval tooling. You just ask what
happened, and it reads the traces for you.
Coming soon.
@AndreiSheina No special prompt or template needed, you can just tell the builder what you want, e.g. "build an assistant agent for my team with GitHub, Langfuse, etc. integrations" and it'll set the whole thing up for you 🙂
At Agentplace we're building a platform for developing and publishing AI agents fast.
Now I'm on the publishing side: you ask the builder to publish, it lands in our plugin marketplace, one command to install it in Claude Code.
Shipping to prod soon. How do you like it?