Introducing FrontierSWE, an ultra-long horizon coding benchmark.
We test agents on some of the hardest technical tasks like optimizing a video rendering library or training a model to predict the quantum properties of molecules.
Despite having 20 hours, they rarely succeed
Opus 4.8 fixes all the issues we observed with previous generations of Opus models.
It is much more token-efficient, better calibrated and it attempts to cheat much less than previous generations. Very impressive release!
Composer 2.5 outperforms all open source models and clearly beats its base model Kimi 2.5 as well as Kimi 2.6. It is roughly on par and slightly ahead of Gemini 3.1 Pro
We still see a large gap between models from Anthropic / OpenAI and other labs
Composer 2.5 is ranked #5 on FrontierSWE
The model is broadly on par with Gemini 3.1 Pro, with a slight edge in our evaluation, and it beats all open source models. We still observe a significant performance gap between Composer and models from Anthropic and OpenAI
This went surprisingly well for our first event - heard great talks and had very interesting conversations about post-training and evals!
A special thanks to our speakers @jyangballin, @rawsh0, @rishiiyer01 and @evan_j_chu, and looking forward to the next one :)
Hosting a research meetup in our North Beach office on Thursday! Come by for food, drinks and talks:
@jyangballin (MSL) will present ProgramBench
@rawsh0 & @rishiiyer01 (Zyphra) will talk about ZAYA-8B
@evan_j_chu and I will speak FrontierSWE and our research bets!
Hosting a research meetup in our North Beach office on Thursday! Come by for food, drinks and talks:
@jyangballin (MSL) will present ProgramBench
@rawsh0 & @rishiiyer01 (Zyphra) will talk about ZAYA-8B
@evan_j_chu and I will speak FrontierSWE and our research bets!
People from top universities are great on average but nothing gets me more excited than talking to someone who went to a no-name uni (possibly in another country) and ended up at an org with a very high bar
This is a great opportunity for engineers or students that want to get into research and contribute to a high-impact publication! Check out the application form below
https://t.co/zfVW1pbeRE
We are hiring research fellows to help us improve FrontierSWE!
If you want to help build the hardest real-world coding benchmark, reach out! Fellows can work with us for a few weeks up to months and will be supported with compute and a generous stipend
https://t.co/KL5va5ydAe
Introducing FrontierSWE, an ultra-long horizon coding benchmark.
We test agents on some of the hardest technical tasks like optimizing a video rendering library or training a model to predict the quantum properties of molecules.
Despite having 20 hours, they rarely succeed
Incredible to see how far prime-rl has come!
Initially, decoupling training and inference much as possible and making asynchronous RL a first class citizen was done out of necessity to support decentralized training.
Later, it turned out that these happened to be exactly the right design choices for agentic RL with extremely long rollouts. Really cool work from @PrimeIntellect and @RampLabs!
Really interesting work! It's super impressive how many of the agentic coding research artifacts (evals, datasets, harnesses, etc.) the community relies on come from @jyangballin, @OfirPress, @KLieret, @18jeffreyma et al.!
How much of SQLite, FFmpeg, PHP compiler can LMs code from scratch? Given just an executable and no starter code or internet access.
Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end. ๐งต
GPT-5.5 is the best-performing model on FrontierSWE.
The model substantially outperforms Opus 4.7 in both mean@5 and best@5 rankings while working faster.