I built an open-source RL environment that trains LLMs to get better at knowledge updating, one of the open problems in long-horizon memory. The idea is simple: when a fact changes, the model has to start answering with the new value and stop using the old one.
This is a known, well-documented failure. Long-term memory has been a big focus at @OpenAI, @AnthropicAI, @GoogleDeepMind, and almost every other frontier lab this year, and remembering things across sessions has gotten genuinely good. Updating them is where it still breaks: you tell it something, that something changes later, and it keeps giving you the old answer.
The best fixes today live inside the closed, frontier systems. The open side only measures the problem, with benchmarks like LongMemEval, MemoryArena, and FAMA, there's no open environment to actually train against. So I built one.
Here's what I found:
- Give the top models the full conversation and they answer these update questions right 92% of the time. Force them onto the bounded memory they actually use in production and it drops to 77%, nearly 1 in 4 wrong.
- The longer the conversation, the worse it gets, past 70% wrong on the longest ones.
- You can't scale out of it: a bigger, smarter model didn't help, and 24x more memory recovered almost none of the loss. It tracks conversation length, not memory size.
Here's the part I care about: you can train it. I took a small open model (Qwen2.5-3B), trained it in the environment with GRPO on a verifiable reward (RLVR), and its accuracy on real conversations it had never seen nearly doubled, from 9% to 16.7%. It only ever practiced on synthetic examples and still improved on real ones, so it learned the actual skill rather than memorizing answers.
As far as I know this is the first environment that rewards a model for keeping a fact current instead of just recalling it. It's not solved, 16.7% is still low, but it's the first sign this gap can be trained. It's a small, measurable corner of what @_sholtodouglas has flagged as one of the big open problems, continual-learning: keeping what a model knows current as the world changes, instead of freezing it at training time.
Try it yourself: https://t.co/5ynbfsX0xT
Built on @willccbb's verifiers/prime-rl, evaluation conversations adapted from LongMemEval, live on @PrimeIntellect's Environments Hub, model and dataset on @huggingface.
.@karpathy's wiki tweet resonated for a reason. Everyone wants AI to understand their knowledge base, not just search through it naively. @garrytan took it further with GBrain, solving the basic filesystem’s scaling, context window, and multi-hop reasoning limitations using hybrid vector+keyword search.
But even the best hybrid RAG systems hit a wall on complex questions:
- They retrieve what's semantically similar, not what's actually relevant to your question's intent
- They can't connect a finding in one document to a contradiction in another
- They treat every query the same whether it needs one fact or a chain of five
- They have no memory of what worked before. Every query starts from zero
Similarity ≠ Relevance. An expert doesn't find answers by scanning for matching keywords. They navigate a structured mental model where every concept is connected to every other concept they've ever encountered.
That's the layer we built. Vrin structures your documents into a knowledge graph, then runs a continuous internal dialogue over it, questioning new facts against existing ones, detecting contradictions, deduplicating across documents, and strengthening connections that get validated by usage, the same way your brain consolidates memories during rest.
At query time, Vrin consults the graph's structure before it even decomposes your question, so it searches for entities that actually exist in your knowledge base, not generic keywords from your query. Then it iteratively reasons through sub-questions, adapting what it's looking for at each step based on what it finds.
Same per-query cost as standard RAG. Dramatically better context.
We ran 20 AI research papers through both systems. Same question, side by side. Vrin surfaced cross-paper insights that standard RAG couldn't reach.
Beta testing now. https://t.co/Q5HJzazq7L