Alex Gates-Shannon (they/them)

@alexgsdev

Building agentic systems that don't hallucinate their own state | Zoltar: AI game moderator for solo TTRPGs | Automata Codex Studio

Ashburn, Virginia

Joined April 2026

78 Following

9 Followers

38 Posts

Pinned Tweet

Alex Gates-Shannon (they/them)

@alexgsdev

about 2 months ago

"And there's a man aboard that ship — position (2,1), crew mess, Deck 2 — that you don't know about yet, who has been there for nine hours already, who has a keycard to Bay 4, and who is running out of time in a different direction entirely."

Alex Gates-Shannon (they/them)

@alexgsdev

15 days ago

"Use a better model" doesn't fix legal AI hallucinations. A model at temperature zero is effectively deterministic and still fabricates. The problem was never randomness. It's that the model predicts plausible tokens with no notion of what's true. Which is why the fix is architectural. https://t.co/6YnoQ8LaEb

Alex Gates-Shannon (they/them)

@alexgsdev

25 days ago

Wrote up the full argument (including the one case where the rule doesn't apply) on the blog. https://t.co/Huc7YWKYrJ

Alex Gates-Shannon (they/them)

@alexgsdev

25 days ago

If your LLM gets it right on the third retry, that's not the model learning. It's a silent error. The correct number of retries when structured output fails validation is one.

Alex Gates-Shannon (they/them)

@alexgsdev

25 days ago

A second failure on the same input means the problem isn't the model. It's your validator, your prompt, or your schema. More retries hide which one. They also run up your bill.

Alex Gates-Shannon (they/them)

@alexgsdev

28 days ago

The cleanest version of this for agentic systems: keep your authoritative state out of the transcript entirely. In my solo-TTRPG game master, recent turns stay primary and older turns compact into a secondary digest, but the character sheet and scene state never enter the conversation at all. They're structured state injected each turn: primary fidelity, secondary-source cost, and immune to compaction drift. The tradeoff you're describing is real. For the state you most need to be correct, you can dodge it.

Matt Pocock

@mattpocockuk

28 days ago

A context engineering metaphor I've been playing around with: - Primary source: the source of truth. Raw data. Transcripts. Code. - Secondary source: one step removed. Summaries. Compactions. Documentation. For instance, compaction takes a primary source (the conversation history) and turns it into a secondary source (the summary). This is lossy, but means the secondary source can fit into a smaller space. If you want to know what your codebase does, your code is a primary source. Your docs are a secondary source. Loading primary sources into context is expensive, but provides richer context. Secondary sources are cheaper to load into context, but may be information-lossy. Any context engineering will involve managing the tradeoffs between both.

476

245

31K

Alex Gates-Shannon (they/them)

@alexgsdev

28 days ago

@brijeshcaet @JustJerry121 Thanks, I'll check it out!

Alex Gates-Shannon (they/them)

@alexgsdev

28 days ago

@ViceSol Do you have links to those lists? That'd be super helpful.

Alex Gates-Shannon (they/them)

@alexgsdev

29 days ago

The split I'd add before those four: some memory must be exact, some is fuzzy recall. Run the exact slice through a ranker and you corrupt your ground truth — not because the ranker failed, but because relevance is the wrong question for a fact with one right answer. RAG for the fuzzy half; exact injection for the rest.

Alex Gates-Shannon (they/them)

@alexgsdev

29 days ago

@trq212 The reliability isn't the subagents; it's the deterministic harness holding state outside any one context window. Sorting example says it: the loop holds the bracket, only the running order stays in context. Same single-agent: drift stops when state lives outside the window.

211

Alex Gates-Shannon (they/them)

@alexgsdev

29 days ago

@mugabuilds I aggressively prune the history and keep state in a separate backend database. State gets injected from the database on each turn, so there's nothing for the model to track.

Alex Gates-Shannon (they/them)

@alexgsdev

29 days ago

@ryanx_ai @JustJerry121 This tracks with one refinement: anything reconstructable from message history will get reconstructed. I prune the history and inject authoritative state each turn, rather than trusting curated context as canonical.

Alex Gates-Shannon (they/them)

@alexgsdev

29 days ago

@EdwinRubioIam Reminds me of this: https://t.co/KlLnHTCXkf

Shreya Shankar

@sh_reya

about 1 month ago

Some thoughts after building the Data Agent Benchmark - build a "semantic layer" for all your data. a simple first pass can be done by running an LLM over a sample of data from each table, to come up with column annotations (commonly known as semantic types), possible functional dependencies (i.e., columns that depend on each other) - use the semantic layer in the prompt for all questions - enterprise data often needs to be cleaned. rather than try to clean all data up front (which is really difficult), keep a memory of subsets of data that need to be cleaned and how, so that the relevant data can be cleaned by the agent at query time - extend the harness (i.e., codex, claude code) with a tool like DocETL or Claude workflows; basically the ability to run agentic map-reduce. often questions require reasoning about unstructured text columns to come up with the answer, which SQL or code doesn't support

Alex Gates-Shannon (they/them)

@alexgsdev

29 days ago

Just caught myself asking Claude Code about a data structure in my own codebase instead of digging it up, even though digging would've been faster. At some point, I stopped optimizing for wall time and started optimizing for activation energy.

Alex Gates-Shannon (they/them)

@alexgsdev

30 days ago

@eugeneyan This generalizes past security: the moment an agent's output becomes state, it's a claim and needs to be validated against ground truth the model can't fabricate. Sometimes that's cheap. Reproducing an exploit in a sandbox is hard — hence the bottleneck.

Alex Gates-Shannon (they/them)

@alexgsdev

about 1 month ago

@JongWerk It depends on what the agent is doing. If you're storing state outside the context window, then you can compact & summarize aggressively.

Alex Gates-Shannon (they/them)

@alexgsdev

about 1 month ago

@LeoTava8 Samesies! https://t.co/6UyOO4Zo0X

Alex Gates-Shannon (they/them)

@alexgsdev

about 1 month ago

@charlespacker I built an agentic app entirely on this premise — state in my own backend with the model behind a tool boundary. Swapping models is just a config change. The ownership argument shows up even at single-app scale, not just platform scale.

Alex Gates-Shannon (they/them)

@alexgsdev

about 1 month ago

@yoheinakajima Thanks! Zoltar's a TTRPG game master built on Claude with a hard tool-call boundary. The model proposes actions, the backend validates and owns all state (dice, inventory, canon). Write-up of the core idea here: https://t.co/6UyOO4Zo0X. Repo: https://t.co/2rorDCk6Ov.

Alex Gates-Shannon (they/them)

@alexgsdev

about 1 month ago

Zoltar uses something close to this. An append-only event log of validated tool calls is the canonical state, and the model only sees a rolling window plus lazy summaries. The model can't fabricate state because it never owns state. Glad to see this formalized.

Yohei

@yoheinakajima

about 1 month ago

babyagi has ~200 citations, but 0 papers... i just published my first paper on arXiv 😆 "The Log is the Agent: Event-Sourced Reactive Graphs for Auditable, Forkable Agentic Systems" https://t.co/c7mbRggdCh the case for agents that coordinate through persistent replayable state — no conversation loops, no workflows, no A2A — with auditability, forking, and causal lineage built in. check it out and let me know what you think!

yoheinakajima's tweet photo. babyagi has ~200 citations, but 0 papers... i just published my first paper on arXiv 😆

"The Log is the Agent: Event-Sourced Reactive Graphs for Auditable, Forkable Agentic Systems"

https://t.co/c7mbRggdCh

the case for agents that coordinate through persistent replayable state — no conversation loops, no workflows, no A2A — with auditability, forking, and causal lineage built in.

check it out and let me know what you think!

275

303

83K

582

Alex Gates-Shannon (they/them)

@alexgsdev

Last Seen Users on Sotwe

Trends for you

Most Popular Users