"And there's a man aboard that ship — position (2,1), crew mess, Deck 2 — that you don't know about yet, who has been there for nine hours already, who has a keycard to Bay 4, and who is running out of time in a different direction entirely."
"Use a better model" doesn't fix legal AI hallucinations. A model at temperature zero is effectively deterministic and still fabricates. The problem was never randomness.
It's that the model predicts plausible tokens with no notion of what's true. Which is why the fix is architectural.
https://t.co/6YnoQ8LaEb
If your LLM gets it right on the third retry, that's not the model learning. It's a silent error. The correct number of retries when structured output fails validation is one.
A second failure on the same input means the problem isn't the model. It's your validator, your prompt, or your schema.
More retries hide which one. They also run up your bill.
The cleanest version of this for agentic systems: keep your authoritative state out of the transcript entirely.
In my solo-TTRPG game master, recent turns stay primary and older turns compact into a secondary digest, but the character sheet and scene state never enter the conversation at all. They're structured state injected each turn: primary fidelity, secondary-source cost, and immune to compaction drift.
The tradeoff you're describing is real. For the state you most need to be correct, you can dodge it.
A context engineering metaphor I've been playing around with:
- Primary source: the source of truth. Raw data. Transcripts. Code.
- Secondary source: one step removed. Summaries. Compactions. Documentation.
For instance, compaction takes a primary source (the conversation history) and turns it into a secondary source (the summary). This is lossy, but means the secondary source can fit into a smaller space.
If you want to know what your codebase does, your code is a primary source. Your docs are a secondary source.
Loading primary sources into context is expensive, but provides richer context. Secondary sources are cheaper to load into context, but may be information-lossy.
Any context engineering will involve managing the tradeoffs between both.
The split I'd add before those four: some memory must be exact, some is fuzzy recall. Run the exact slice through a ranker and you corrupt your ground truth — not because the ranker failed, but because relevance is the wrong question for a fact with one right answer. RAG for the fuzzy half; exact injection for the rest.
@trq212 The reliability isn't the subagents; it's the deterministic harness holding state outside any one context window. Sorting example says it: the loop holds the bracket, only the running order stays in context. Same single-agent: drift stops when state lives outside the window.
@mugabuilds I aggressively prune the history and keep state in a separate backend database. State gets injected from the database on each turn, so there's nothing for the model to track.
@ryanx_ai@JustJerry121 This tracks with one refinement: anything reconstructable from message history will get reconstructed. I prune the history and inject authoritative state each turn, rather than trusting curated context as canonical.
Some thoughts after building the Data Agent Benchmark
- build a "semantic layer" for all your data. a simple first pass can be done by running an LLM over a sample of data from each table, to come up with column annotations (commonly known as semantic types), possible functional dependencies (i.e., columns that depend on each other)
- use the semantic layer in the prompt for all questions
- enterprise data often needs to be cleaned. rather than try to clean all data up front (which is really difficult), keep a memory of subsets of data that need to be cleaned and how, so that the relevant data can be cleaned by the agent at query time
- extend the harness (i.e., codex, claude code) with a tool like DocETL or Claude workflows; basically the ability to run agentic map-reduce. often questions require reasoning about unstructured text columns to come up with the answer, which SQL or code doesn't support
Just caught myself asking Claude Code about a data structure in my own codebase instead of digging it up, even though digging would've been faster. At some point, I stopped optimizing for wall time and started optimizing for activation energy.
@eugeneyan This generalizes past security: the moment an agent's output becomes state, it's a claim and needs to be validated against ground truth the model can't fabricate.
Sometimes that's cheap. Reproducing an exploit in a sandbox is hard — hence the bottleneck.
@JongWerk It depends on what the agent is doing. If you're storing state outside the context window, then you can compact & summarize aggressively.
@charlespacker I built an agentic app entirely on this premise — state in my own backend with the model behind a tool boundary. Swapping models is just a config change. The ownership argument shows up even at single-app scale, not just platform scale.
@yoheinakajima Thanks! Zoltar's a TTRPG game master built on Claude with a hard tool-call boundary. The model proposes actions, the backend validates and owns all state (dice, inventory, canon). Write-up of the core idea here: https://t.co/6UyOO4Zo0X. Repo: https://t.co/2rorDCk6Ov.
Zoltar uses something close to this. An append-only event log of validated tool calls is the canonical state, and the model only sees a rolling window plus lazy summaries. The model can't fabricate state because it never owns state. Glad to see this formalized.
babyagi has ~200 citations, but 0 papers... i just published my first paper on arXiv 😆
"The Log is the Agent: Event-Sourced Reactive Graphs for Auditable, Forkable Agentic Systems"
https://t.co/c7mbRggdCh
the case for agents that coordinate through persistent replayable state — no conversation loops, no workflows, no A2A — with auditability, forking, and causal lineage built in.
check it out and let me know what you think!