Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text
"constructing a multiple-choice question-answering version of the fill-in-the-middle task"
"Given a source text, we prompt an LLM to identify and mask key reasoning steps, then generate a set of diverse, plausible distractors."
"GooseReason effectively revives models saturated on existing RLVR data"
"GooseReason-Cyber sets a new state-of-the-art in cybersecurity, surpassing a 7B domain-specialized model with extensive domain-specific pre-training and post-training"
I make no apology for posting this photo. I think it’s a photo that will haunt America in the years to come.
This is what you have become. If you defend this, have a word with yourself.
Also, super honored to give my first conference talk at NeurIPS 2025 about Terminal Bench, Harbor, and Adapters!
If you are interested in our work or want to gain some context in 15 minutes, this might be a great resource👀
The Terminal-Bench paper is here! Read it to learn where frontier models still fail and the secrets of how we sourced hundreds of high quality environments from our open source community. 🧵
This addresses an important fundamental problem with RL, that is RL training sucks the bits of supervision in all the steps/tokens in the answer through a tiny straw that is the reward scalar.
Textual feedbacks are much more nuanced and informative.
Following the Text Gradient at Scale
We wrote a @StanfordAILab blog post about the limitations of RL methods that learn solely from scalar rewards + a new method that addresses this
Blog: https://t.co/rJ1IcBKDoR
Paper: https://t.co/75pHtElyk3
How far are we from having competent AI co-workers that can perform tasks as varied as software development, project management, administration, and data science?
In our new paper, we introduce TheAgentCompany, a benchmark for AI agents on consequential real-world tasks.
I'll get straight to the point.
We trained 2 new models. Like BERT, but modern. ModernBERT.
Not some hypey GenAI thing, but a proper workhorse model, for retrieval, classification, etc. Real practical stuff.
It's much faster, more accurate, longer context, and more useful. 🧵
1/ Can Large Language Models (LLMs) truly reason? Or are they just sophisticated pattern matchers? In our latest preprint, we explore this key question through a large-scale study of both open-source like Llama, Phi, Gemma, and Mistral and leading closed models, including the recent OpenAI GPT-4o and o1-series.
https://t.co/2tv8Pp9MSz
Work done with @i_mirzadeh, @KeivanAlizadeh2, Hooman Shahrokhi, Samy Bengio, @OncelTuzel.
#LLM #Reasoning #Mathematics #AGI #Research #Apple
We put OpenAI o1 to the test against ARC Prize.
Results: both o1 models beat GPT-4o. And o1-preview is on par with Claude 3.5 Sonnet.
Can chain-of-thought scale to AGI? What explains o1's modest scores on ARC-AGI?
Our notes:
https://t.co/sV6LM1foGx
We put OpenAI o1 to the test against ARC Prize.
Results: both o1 models beat GPT-4o. And o1-preview is on par with Claude 3.5 Sonnet.
Can chain-of-thought scale to AGI? What explains o1's modest scores on ARC-AGI?
Our notes:
https://t.co/sV6LM1foGx
ZML: a high-performance AI inference stack that can parallelize and run deep learning systems on lots of different hardware.
It's out of stealth, impressive, and open source.
Another paper pointing out in details what we've known for a while: LLMs (used via prompting) cannot make sense of situations that substantially differ from the situations found in their training data. Which is to say, LLMs do not possess general intelligence to any meaningful degree.
What LLMs can be good for, is to serve as knowledge/routine stores for an actual AGI. They're a memory -- a representation of a data corpus -- and memory is a necessary component of intelligence. But keep in mind that intelligence is not just memory.
Famous European wines like Champagne and Barolo have been built on a marriage of weather, terroir and grape varieties.
In an era of climate change, the rigidity that makes many of these wines unique will make them vulnerable, too:
https://t.co/mqu9WxE5mZ via @opinion
A reminder that China has pledged to peak their emissions by 2030. If they wind up reaching a peak six years early, even if it's a plateau for some time, that's a major development in the effort to peak and reduce global emissions...