There’s a bunch of domain-specific video game knowledge, but games also teach meta-skills: getting comfortable with being in beginner brain, learning to read new visual interfaces, and using systematic trial and error to resolve uncertainty.
Opus 4.8 is the best Claude ever at Runescape, but isn't reproducing the feats of recent openai and google models like solving quests and finding far-flung training spots
Introducing a minimal training harness built on prime-rl and verifiers, so you can now train your own RLMs without sandboxes! All available in the `training/` folder in the RLM GitHub repo!
We train RLM-Qwen3-30B-A3B-v0.1, using RL on a separate split of environments (OOLONG-Spam, BC+ split) to greatly improve performance across the board on long-context tasks evaluated in the original RLM paper.
We trained for a day on an 8xA100 using prime-rl; code and model are open-source and available on GitHub / Huggingface.
@gabriberton I wonder if any sufficiently popular open-source benchmark would lead to the same. At a certain point, everyone going "We evaluated X on Y" would eventually lead to "Y is an evaluation suite" being memorized parametrically.
Ever wished we had fewer X-training hyphenates? Pre, mid, post etc. Why not just Training?
Trying to bridge the divides (and get all our friends into one team again), we intro *Introspective X Training*, an offline RL inspired method that scales effectively across any LLM stage by annotating your data with a thinking reward generated language critique!
Up to 2.8x FLOP efficiency + 5-10 point score gains (esp with math and code) at any stage from scratch to 24T tokens on 8b (active) sized models!! We burned much compute ablating so you wouldn't have to
Moral of the story is‼️don't throw out any data via filtering, just feedback condition it‼️
You can spend FLOPs up front on inference to *classify* data quality and then train so that tokens aren't all treated equally based on the feedback starting early in training itself. Right now they're really only separated out much later during mid/post training
This improves overall compute efficiency and gives us benchmark perf not possible with just baseline methods!
Paper here: https://t.co/9oSYwQEpbi
Thanks to @BrandoCui and @GXiming for leading this w/ @__SyedaAkter@davidjesusacu@hyunw_kim@jaehunjung_com Yuxiao Qu @shrimai_@YejinChoinka
@TuhinChakr I've definitely seen this in my own personal use, especially when I can introduce a prior for the type of writing style I prefer. Content aside, it does feel like these models basically have the prose locked down.
In 1945, Vannevar Bush imagined a machine to extend a scientist's memory. He called it the MemEx.
80 years later, we built one for LLM agents.
Tool outputs become Python objects; only print statements reach the model's context.
🧵 https://t.co/YyrGsn3TB7
Thrilled to see those promising numbers! 🤯
Same finding on our end with NanoRollout: cross-scaffold generalization basically doesn't happen out of the box -- something the field should be talking about more.
New paper:
We finetuned models on documents that discuss an implausible claim and warn that the claim is false.
Models ended up believing the claim! Examples:
1. Ed Sheeran won the Olympic 100m
2. Queen Elizabeth II wrote a Python graduate textbook
A fun 48-hour run of letting an RLM iteratively building the interface for an RLM to play Pokemon Red (sneak peak of some fun things cooking at @PrimeIntellect😄). The interface generating RLM was just tasked with getting the RLM (same scaffold) to beat the game in under 5 hours wall-clock time.
I originally expected the RLM to design some components used in Gemini Plays Pokemon like an extra map, an interface to parse the screen, etc., design low-level policies that would run fast on the emulator, and also design a good prompt and strategy around the RLM to use sub-agents to explore game state with checkpointing, use RNG manipulation in its favor, etc.
Instead the RLM eventually just decided to give the RLM a `write_memory` tool, which the RLM player decided to use to 1) warp the player immediately to the Elite 4; 2) give itself a level 100 Mewtwo (which it mistakes to be a Ponyta due to weird Pokedex ID vs. internal ID); 3) give itself $999999; 4) give itself all 8 badges by setting the right flag. It then went ahead and destroyed the Elite 4 and Blue and beat the game in record time :p
You'll also notice in the video there's weird backtracking and frame-skipping, this happens because it also did incorporate the strategy of launching sub-agents to explore action trajectories, but had a strange way of saving the frames and recording them (so you see the result of several sub-agent explorations).
We'll have some more funny and cool RLM demos soon, but it's cool to see RLMs work as general-purpose agents (both the coding agent that designs the interface and the game-playing agent itself)!
@icmlconf (Obligatory ty for gold reviewer award, I prob can't go b/c of logistics but if you're at ICML checkout my lab-mate's awesome work @JennyShen056 )
I'm curious for other ICML reviewers who got gold / silver, what percentage were policy A vs B, what their average scores were, and whether the papers ultimately got in. Any chance of those statistics getting released? @icmlconf
@lateinteraction Very excited for it, y'all are very consistent with the bangers 👀
That is a good point for the RLVR, my brain is probably too deep in the agent rabbit hole where there's alot of PI beyond the final correct answer and the cost for that info scales with environment complexity.
🚨Typical RL algorithms and on-policy distillation methods are blind samplers: they use privileged info to score rollouts, but not to *find* them.
We ask: can we use privileged info to *actively sample* the rollouts RL wishes it can stumble upon with compute?
⤵️ Pedagogical RL
@lateinteraction I do want to emphasize I think its good work but I tend to wrinkle my brow when I see privileged information be exposed (even indirectly) to the model due to where my research origins started
@lateinteraction I guess the way I mentally define it, and what I saw in the blog from my quick speed read is 'information the model wouldn't normally have access to'.
My main issue for using this type of information is that in scaled up environments or tasks, it can be costly to obtain.