@karpathy's AutoResearch made one thing visible:
the frontier question is no longer whether a model can answer once.
It is whether it can survive the loop.
That is why we built AutoLab.
161 evals | 23 tasks | 7 frontier models | 8,891 trajectories | 633M tokens
If you want to watch agents struggle, double down, pivot, and occasionally break through, come watch the Live Lab:
https://t.co/v4HRAc8ouz
meta: my chat with Claude got too long while drafting this critique of the RLM paper. Claude couldn't fit the full conversation in context. so it grepped the local transcript file and pulled in relevant sections. context as external variable, examined and retrieved programmatically... wait, my Claude is already doing RLM?
the paper (@a1zhang, @lateinteraction). the core problem is real: models need clean separation between the context they're reasoning over and the intermediate results of exploring that context. tool outputs and sub-call results shouldn't pollute the window you're thinking in. context rot from accumulated junk is a genuine failure mode.
but this divide-and-conquer is already happening at the harness level and useful patterns are being RLed into models. plan mode → external checklist → Ralph Wiggum loops working through tasks one at a time with fresh context. subagents returning distilled results so junk never hits the parent window. context-driven file exploration (check length, grep structure, selectively read)...
do the above well and each sub-task gets a focused window with mostly relevant context. this is where RLM's recursive approach actually costs you — every sub-call is a fresh prefill with no KV cache sharing, plus scaffolding overhead. when context is mostly relevant and fits in window, a warm cache with full cross-context attention wins outright.
the training contribution is clean RL env design: the model can't read long snippets from the prompt, forcing it to learn selective exploration and recursive decomposition. but existing coding tools already impose the same constraint — Claude Code's read tool rejects files over ~25k tokens. models are already learning context decomposition because their harness tooling forces it when being RLed.
for frontier models, the path forward is better divide-and-conquer, better tool use for external context — transcripts, persisted state files, disk artifacts — and better RL for learning when to decompose. not a new paradigm. All these are already underway. some RLM patterns are already there, as the opening makes clear.
We are announcing the Lighter Infrastructure Token (LIT)! Lighter is building infrastructure for the future of finance and the native token is key to aligning incentives. In this thread, we will describe the structure of the token, broader vision, and roadmap of use cases.
We’ve raised 17 million led by @PanteraCapital, with participation from @Sequoia and others.
Fin enables users and businesses to move millions of dollars instantly - whether to other Fin users, directly into bank accounts, or across crypto rails.
If banks and payment products could be rebuilt from the ground up today, they would look like Fin.
we first met w sequoia in May 22. They invested in Dec 23. @shaunmmaguire intro'd us to spacex (and pushed us hard to think bigger), @josephinekchen connected us to some of our first customers, @Alfred_Lin helped us reason through scaling problems ...
If you can meet w sequoia, do it. If you can partner with them, definitely do it. it might take some time.
they very much helped shape bridge