I built an open-source RL environment that trains LLMs to get better at knowledge updating, one of the open problems in long-horizon memory. The idea is simple: when a fact changes, the model has to start answering with the new value and stop using the old one.
This is a known, well-documented failure. Long-term memory has been a big focus at @OpenAI, @AnthropicAI, @GoogleDeepMind, and almost every other frontier lab this year, and remembering things across sessions has gotten genuinely good. Updating them is where it still breaks: you tell it something, that something changes later, and it keeps giving you the old answer.
The best fixes today live inside the closed, frontier systems. The open side only measures the problem, with benchmarks like LongMemEval, MemoryArena, and FAMA, there's no open environment to actually train against. So I built one.
Here's what I found:
- Give the top models the full conversation and they answer these update questions right 92% of the time. Force them onto the bounded memory they actually use in production and it drops to 77%, nearly 1 in 4 wrong.
- The longer the conversation, the worse it gets, past 70% wrong on the longest ones.
- You can't scale out of it: a bigger, smarter model didn't help, and 24x more memory recovered almost none of the loss. It tracks conversation length, not memory size.
Here's the part I care about: you can train it. I took a small open model (Qwen2.5-3B), trained it in the environment with GRPO on a verifiable reward (RLVR), and its accuracy on real conversations it had never seen nearly doubled, from 9% to 16.7%. It only ever practiced on synthetic examples and still improved on real ones, so it learned the actual skill rather than memorizing answers.
As far as I know this is the first environment that rewards a model for keeping a fact current instead of just recalling it. It's not solved, 16.7% is still low, but it's the first sign this gap can be trained. It's a small, measurable corner of what @_sholtodouglas has flagged as one of the big open problems, continual-learning: keeping what a model knows current as the world changes, instead of freezing it at training time.
Try it yourself: https://t.co/5ynbfsX0xT
Built on @willccbb's verifiers/prime-rl, evaluation conversations adapted from LongMemEval, live on @PrimeIntellect's Environments Hub, model and dataset on @huggingface.
I built an open-source RL environment that trains LLMs to get better at knowledge updating, one of the open problems in long-horizon memory. The idea is simple: when a fact changes, the model has to start answering with the new value and stop using the old one.
This is a known, well-documented failure. Long-term memory has been a big focus at @OpenAI, @AnthropicAI, @GoogleDeepMind, and almost every other frontier lab this year, and remembering things across sessions has gotten genuinely good. Updating them is where it still breaks: you tell it something, that something changes later, and it keeps giving you the old answer.
The best fixes today live inside the closed, frontier systems. The open side only measures the problem, with benchmarks like LongMemEval, MemoryArena, and FAMA, there's no open environment to actually train against. So I built one.
Here's what I found:
- Give the top models the full conversation and they answer these update questions right 92% of the time. Force them onto the bounded memory they actually use in production and it drops to 77%, nearly 1 in 4 wrong.
- The longer the conversation, the worse it gets, past 70% wrong on the longest ones.
- You can't scale out of it: a bigger, smarter model didn't help, and 24x more memory recovered almost none of the loss. It tracks conversation length, not memory size.
Here's the part I care about: you can train it. I took a small open model (Qwen2.5-3B), trained it in the environment with GRPO on a verifiable reward (RLVR), and its accuracy on real conversations it had never seen nearly doubled, from 9% to 16.7%. It only ever practiced on synthetic examples and still improved on real ones, so it learned the actual skill rather than memorizing answers.
As far as I know this is the first environment that rewards a model for keeping a fact current instead of just recalling it. It's not solved, 16.7% is still low, but it's the first sign this gap can be trained. It's a small, measurable corner of what @_sholtodouglas has flagged as one of the big open problems, continual-learning: keeping what a model knows current as the world changes, instead of freezing it at training time.
Try it yourself: https://t.co/5ynbfsX0xT
Built on @willccbb's verifiers/prime-rl, evaluation conversations adapted from LongMemEval, live on @PrimeIntellect's Environments Hub, model and dataset on @huggingface.
for the retriever to actually fetch insightful context without context window bloat, crucial in answering complex nuanced questions, smart structuring of the data is extremely important.
Structure enables multi-hop thinking. That’s what https://t.co/Q5HJzazq7L provide the AI Agents.
Watch this video for an example.
@tom_doerr great that it highlights, retrieval-time reasoning! Been highlighting it’s importance and optimizing for that at https://t.co/s1bn0a7mt5
Check it out if you want to setup much advanced second brain in just a few clicks.
GBrain nails the ingestion and hybrid retrieval layer solving scaling, context window, and multi-hop reasoning limitations. What we've been finding building https://t.co/Q5HJzazq7L is that hybrid search is necessary but not sufficient for complex questions.
How you structure the knowledge before retrieval determines the quality of context your AI agent gets. That's the layer we're focused on.
Exactly this: "the brain needs to be great at self-organizing in a thoughtful schema". This is the bottleneck nobody's talking about.
How we structure and organize our knowledge is what shapes our thinking. And it's what will determine what your AI can and cannot reason over.
Most systems today structure knowledge as vectors. Vectors capture similarity. They can't encode contradictions, track how relationships change over time, or connect insights across documents.
No amount of inference-time reasoning fixes poorly structured context.
We're building this layer at https://t.co/Q5HJzazq7L. Knowledge graph that self-organizes, consolidates over time, and reasons across the structure at query time.
.@karpathy's wiki tweet resonated for a reason. Everyone wants AI to understand their knowledge base, not just search through it naively. @garrytan took it further with GBrain, solving the basic filesystem’s scaling, context window, and multi-hop reasoning limitations using hybrid vector+keyword search.
But even the best hybrid RAG systems hit a wall on complex questions:
- They retrieve what's semantically similar, not what's actually relevant to your question's intent
- They can't connect a finding in one document to a contradiction in another
- They treat every query the same whether it needs one fact or a chain of five
- They have no memory of what worked before. Every query starts from zero
Similarity ≠ Relevance. An expert doesn't find answers by scanning for matching keywords. They navigate a structured mental model where every concept is connected to every other concept they've ever encountered.
That's the layer we built. Vrin structures your documents into a knowledge graph, then runs a continuous internal dialogue over it, questioning new facts against existing ones, detecting contradictions, deduplicating across documents, and strengthening connections that get validated by usage, the same way your brain consolidates memories during rest.
At query time, Vrin consults the graph's structure before it even decomposes your question, so it searches for entities that actually exist in your knowledge base, not generic keywords from your query. Then it iteratively reasons through sub-questions, adapting what it's looking for at each step based on what it finds.
Same per-query cost as standard RAG. Dramatically better context.
We ran 20 AI research papers through both systems. Same question, side by side. Vrin surfaced cross-paper insights that standard RAG couldn't reach.
Beta testing now. https://t.co/Q5HJzazq7L
what is it about human cognition that synthesizes relatively unique insights in every individual from the exact same information?
how can we replicate that in AI?
because not everyone understands if there was a problem in the first place so why would they care about the problem? like most wouldn’t think vector based RAG is a problem as it has been working decently until there are innovative approaches that understands the nuances in our queries are implemented and become mainstream
LLM Knowledge Bases
Something I'm finding very useful recently: using LLMs to build personal knowledge bases for various topics of research interest. In this way, a large fraction of my recent token throughput is going less into manipulating code, and more into manipulating knowledge (stored as markdown and images). The latest LLMs are quite good at it. So:
Data ingest:
I index source documents (articles, papers, repos, datasets, images, etc.) into a raw/ directory, then I use an LLM to incrementally "compile" a wiki, which is just a collection of .md files in a directory structure. The wiki includes summaries of all the data in raw/, backlinks, and then it categorizes data into concepts, writes articles for them, and links them all. To convert web articles into .md files I like to use the Obsidian Web Clipper extension, and then I also use a hotkey to download all the related images to local so that my LLM can easily reference them.
IDE:
I use Obsidian as the IDE "frontend" where I can view the raw data, the the compiled wiki, and the derived visualizations. Important to note that the LLM writes and maintains all of the data of the wiki, I rarely touch it directly. I've played with a few Obsidian plugins to render and view data in other ways (e.g. Marp for slides).
Q&A:
Where things get interesting is that once your wiki is big enough (e.g. mine on some recent research is ~100 articles and ~400K words), you can ask your LLM agent all kinds of complex questions against the wiki, and it will go off, research the answers, etc. I thought I had to reach for fancy RAG, but the LLM has been pretty good about auto-maintaining index files and brief summaries of all the documents and it reads all the important related data fairly easily at this ~small scale.
Output:
Instead of getting answers in text/terminal, I like to have it render markdown files for me, or slide shows (Marp format), or matplotlib images, all of which I then view again in Obsidian. You can imagine many other visual output formats depending on the query. Often, I end up "filing" the outputs back into the wiki to enhance it for further queries. So my own explorations and queries always "add up" in the knowledge base.
Linting:
I've run some LLM "health checks" over the wiki to e.g. find inconsistent data, impute missing data (with web searchers), find interesting connections for new article candidates, etc., to incrementally clean up the wiki and enhance its overall data integrity. The LLMs are quite good at suggesting further questions to ask and look into.
Extra tools:
I find myself developing additional tools to process the data, e.g. I vibe coded a small and naive search engine over the wiki, which I both use directly (in a web ui), but more often I want to hand it off to an LLM via CLI as a tool for larger queries.
Further explorations:
As the repo grows, the natural desire is to also think about synthetic data generation + finetuning to have your LLM "know" the data in its weights instead of just context windows.
TLDR: raw data from a given number of sources is collected, then compiled by an LLM into a .md wiki, then operated on by various CLIs by the LLM to do Q&A and to incrementally enhance the wiki, and all of it viewable in Obsidian. You rarely ever write or edit the wiki manually, it's the domain of the LLM. I think there is room here for an incredible new product instead of a hacky collection of scripts.
Furthermore, vector/keyword similarity retrieval is a dead end when dealing with complex questions as it cannot understand the nuance of what you’re asking. That’s why we need retrieval-time reasoning.
And that’s exactly why, we’ve built @vrindotcloud to enable multi-hop reasoning within AI Agents.
@karpathy have you verified how often this setup helped the LLM tackle true multi-hop queries? since transformer architecture itself can only do a limited # of hops
also during longer tasks, any regression due to “Lost in the Middle” phenomenon?