What Makes a Base Language Model Suitable for RL?
Rumors in the community say RL (i.e., RLVR) on LLMs is full of “mysteries”:
(1) Is the magic only happening on Qwen + Math?
(2) Does the "aha moment" only spark during math reasoning?
(3) Is evaluation hiding some tricky traps?
(4) Is RL’s calm surface all thanks to Pre/Mid-training carrying the weight?
Why does RL on LLaMA consistently underperform compared to Qwen?
What makes a base model truly ready for RL scaling?
What are the secrets under the hood?
Due to the cost of training from scratch, we conduct extensive controlled experiments with 20B-token mid-training, systematically investigating what really matters for RL success.
💡Key insights:
- High-quality math data is key to RL scaling.
- QA data helps, but it depends on task similarity.
- Instruction data boosts QA’s effectiveness.
- More mid-training improves RL performance.
Armed with these insights, we apply a two-stage (stable+decay) mid-training strategy on LLaMA, scaling up to 200B tokens—and RL performance on LLaMA now matches Qwen!
To support this, we introduce MegaMath-Web-Pro-Max, a high-quality math-centric pretraining corpus. The dataset will be released soon on Hugging Face—stay tuned! 📦 https://t.co/2nKPTS3r7B
Full construction details are in the paper, we hope it’s useful! https://t.co/z9uHAwVnJO
Getting SOTA with a strong foundation is great 🤩,
but understanding the foundation—the know-how—matters just as much.
Hope this analysis inspires the community—and feel free to cite us if it helps!
This work is impossible without all the brilliant co-authors @FaZhou_998@xuefengli0301@stefan_fee !!!
Cool paper by AllenAI and CMU on how to pretrain LLMs for long-context understanding
The setup is legit and the findings are interesting, and they show that most LLMs have architectural issues 🧵
there is no better time in tech than now to be a jack of all trades, master of a few.
just make sure to keep adding to the few year over year, such that the cumulative breadth of expertise you collect becomes an increasingly rare combo. remember, if you're top 10% in 3 different areas, that already makes you top 0.1%. keep switching it up until you get to "your best", and then switch it up again (great for a particular flavor of people who don't enjoy resting on laurels, maybe not so great for others).
question all institutional value and pedigrees, all traditional career paths or corporate ladders: the college industrial complex is getting shaken up, alongside a disappearing managerial class, so if you're pursuing either make sure you are fully internally aligned with why. social/political capital in a particular institution can feel incredible, but if you're spending all your energy on complex political people games, you're not a technologist anymore, you're an unelected politician. if you're ok with that, then all's well.
critical thinking is more important than ever: take nothing at face-value, question everything and everyone. the equivalent of ai slop can be found in humans operating under misaligned incentives and interests. the sooner you're clued into disambiguating the talkers/larpers from the doers, the better off you'll be figuring out where and who to invest your time in.
the anxiety of job displacement is very real, since a surprising amount of white collar work/prestige is built on a performative house of cards, significantly lacking in correlation with technical breadth, depth, and skill. as long as you keep learning, keep building, keep producing receipts, you will be fine.
if all that sounds ok to you, welcome to the world of technology! it's truly one of the few places you can experience child-like wonder every few years, and be constantly humbled & excited by new adventures, as scary as they may seem at first.
don't give up, drink your water, get your sunlight, and take breaks as needed. tech careers are notoriously nonlinear, so you might as well embrace it and enjoy the ride!
We released physics-intern: a simple harness for science problems!
It gets models like Gemini 3.1 Pro to go from 17.7 -> 31.4, thus beating GPT 5.5 Pro.
The physics-intern harness can wrap any model and via dedicated subagent boost the performance of the vanilla reasoning models.
While I think more and more of these harness capability gains will be absorbed into the models (like prompting tricks disappeared over time) there is a lot to be gained right now by building good scaffolds for those models and integrating tools well.
Interestingly, the exception we found that GPT 5.5 Pro actually didn't benefit from the physics-intern harness!
Read more about it here: https://t.co/x74RAYCt5B
PS: I think the Harness[Model] notation is kind of nice.
this is fascinating, they train an encoder/decoder but use LLM matching the target model's shape for each part, so the latent space is just plain language and they can detect reward hacking, unwanted behavior and more
could even see it being used as an eval to quantify how smart a model is, i love this
OpenAI introduces an additional layer of defense against misaligned or confused coding agents, complementing chain of thought monitoring we use internally. When Codex wants to execute a risky action outside of its sandbox, a separate Codex agent is asked to approve or deny it.
It's time to change my work philosophy after witnessing the remarkable productivity of the frontier flagship model.
Past: I needed to prepare many useful tools for myself to improve efficiency in my workspace.
Now, I need to create a seamless workspace for AI, enabling it to interact with various tools, engines, and access permissions. It could boost my efficiency by at least 2 to 5 times. More importantly, the saved time allows me to think in a global perspective, waiving many manual labour and having a coffee by the way.
While I recognize that this change might be a bit late, I’m glad to know it’s still not too late to adapt.
🚀🚀🚀
The Arcee AI Podcast is here!
In this episode, @latkins and @stochasticchasm join us to discuss the story of Trinity models and everything frontier. I can say, this talk has been one of the most amazing and technical conversations we've had on Ground Zero.
0:00:00 - Intro
0:00:59 - Varun's transition from SWE to Pre-Training Lead
0:04:20 - Trinity Manifesto, Openclaw Ecosystem 0:12:15 - Arcee's Post-Training to Pre-Training Pivot 0:23:45 - Varun's first Pre-Training run (you can just do things!)
0:27:33 - Saturation in Pre-Training?, Mid-Training 0:37:00 - Tweaking the Training Architecture, Adam vs Muon, Evals
01:09:07 - Inference Engineering, Quick Fire, Post-Training Recipe
01:18:02 - Alpha in RL Envs, Harness Design
01:23:00 - American Open Source is trailing Chinese Competitors, Trinity Adoption
01:29:25 - Hiring at Arcee, Advice to 20yo
Aha, thank you for the kind words! We’re exploring what “frontier lab” means in academia—through democratizing cognition and embracing “less is more” & “simple is powerful”.
Recent releases:
- agentic intelligence: davinci-dev, davinci-agency, davinci-env
- open foundation model: davinci-llm, davinci-magihuman
- data efficiency: (lima) limo, limr, limi
- benchmark: agencybench, researcherbench, innovatorbench ...
- data darwinism PartI, Part II
- interaction as Intelligence: Part I, Part II
- engineering: prompt engineering, cognition engineering, context engineering 2.0
More at:
https://t.co/x4Nw4qohCX
Our North Star: Using AI technology to make life better for people around us.
Would love to exchange ideas if any of these interest you!
Seedance 2.0 is impressive. But it's closed-source!
Introducing our daVinci-MagiHuman — a single-stream 15B Transformer trained from scratch that jointly generates video + audio. No cross-attention. No multi-stream branches. Just self-attention.
⚡ 5s 1080p video in 38s on a single H100
🏆 80% win rate vs Ovi 1.1 | 60.9% vs LTX 2.3 (2,000 human comparisons)
🌍 6 languages
📦 Fully open-source
Speed by simplicity.
By @SII_GAIR × @SandAI_HQ
📄 https://t.co/SgFOunlEIj
💻 https://t.co/9rwNWzlMKN
🤗 https://t.co/txduP5FgIC