We are sharing an early preview of our ongoing SWE-1.6 training run.
It significantly improves upon SWE-1.5 while being post-trained on the same pre-trained model - and it runs equally as fast at 950 tok/s. On SWE-Bench Pro it exceeds top open-source models.
The preview model still exhibits some undesirable behaviors like overthinking and excessive self-verification, which we aim to improve. We are rolling out early access to a small subset of users in Windsurf.
OpenAI is moving away from SWE-Bench Verified, citing challenges on underspecified tasks, misaligned tests, and contamination.
We agree. These were exactly the motivations behind SWE-Bench Pro (https://t.co/ctvRFXzWu6).
What we changed:
→ Underspecified tasks: structured, executable problem definitions
→ Contamination: strict curation + private / commercial codebases
But this is just step one.
Where we’re pushing frontier coding evals next:
→ Beyond unit tests: rubric-based evaluation (https://t.co/9oifP5FQJO)
→ From static tasks to real-world agentic environments
Modern coding systems are not solving isolated problems. They operate as agents over repos, tools, and long-horizon workflows. Our evals need to reflect that.
SWE-Bench Pro is one step toward more realistic and reliable evaluation for coding agents.
We’ll keep pushing the frontier.
The standard for frontier coding evals is changing with model maturity.
We now recommend reporting SWE-bench Pro and are sharing more detail on why we’re no longer reporting SWE-bench Verified as we work with the industry to establish stronger coding eval standards.
SWE-bench Verified was a strong benchmark, but we’ve found evidence it is now saturated due to test-design issues and contamination from public repositories.
https://t.co/3GeAsnUHdC
Introducing M2.5, an open-source frontier model designed for real-world productivity.
- SOTA performance at coding (SWE-Bench Verified 80.2%), search (BrowseComp 76.3%), agentic tool-calling (BFCL 76.8%) & office work.
- Optimized for efficient execution, 37% faster at complex tasks.
- At $1 per hour with 100 tps, infinite scaling of long-horizon agents now economically possible
MiniMax Agent: https://t.co/aIzrFYcfUz
API: https://t.co/fHRdSV7BwZ
CodingPlan: https://t.co/FDhZBBjQrX
GPT-5.3-Codex's much better token efficiency *AND* faster inference is the biggest story of this release. Folks at @OpenAI worked hard to improve this and it will only get better from here.
GPT-5.3-Codex is here!
*Best coding performance (57% SWE-Bench Pro, 76% TerminalBench 2.0, 64% OSWorld).
*Mid-task steerability and live updates during tasks.
*Faster! Less than half the tokens of 5.2-Codex for same tasks, and >25% faster per token!
*Good computer use.
This release is an emtional one for me because I had stayed up so much for it 🥹 It has been truly amazing to see this model becomes better bit by bit through every change we make, and we have come a long way.
Since I did mid-training for this model, I wanted to share a little anecdote about this part. We really made this model with user experience as first-class consideration. We want people to actually use it, period. We took it so serious that we redid midtraining because we saw cases where models failed to follow instructions on out-of-distribution scaffolds. We decided straight-up that we would fix this in a fundamental way instead of surface-level patching. The resulting base model, which we also release, is thus a healthy base. We find that, compared to other base models, this one better learns new tasks.
Try fine-tuning our base and lmk what you think 🥳
https://t.co/KSvowSEdTu
🚀New @scale_AI research:
Verifiers for SWE Agents have traditionally used unit tests or simple, execution-free classifiers. But can we get verifiers that are more expressive, repository-grounded, and still execution-free at scoring time?
We explore Agentic Rubrics to fill this gap 💡
Agentic Rubrics are repo-grounded, execution-free verifiers for SWE agents. We generate a checklist of concrete, codebase-specific criteria using an Agentic Harness, and then score patches against it. 🧑💻
🚀New @scale_AI research:
Verifiers for SWE Agents have traditionally used unit tests or simple, execution-free classifiers. But can we get verifiers that are more expressive, repository-grounded, and still execution-free at scoring time?
We explore Agentic Rubrics to fill this gap 💡
Agentic Rubrics are repo-grounded, execution-free verifiers for SWE agents. We generate a checklist of concrete, codebase-specific criteria using an Agentic Harness, and then score patches against it. 🧑💻
Results:
- self-improvement on SWE-bench Verified (+10.4) and Pro (+7.8)
- better than the baseline RL using human issue data over the course of training
New Scale research: Do AI models actually reason in ways humans can trust for real-world decisions?
Introducing MoReBench, the first benchmark for procedural moral reasoning in LLMs, measuring not just what models decide, but how they reason through moral ambiguity.
@scale_AI Check out the paper and dataset:
Paper: https://t.co/2q8KJEicaD
Github: https://t.co/JqshJFKsSb
Dataset: https://t.co/yOW8B2QLfn
Leaderboard: https://t.co/p8bL3cWjZG
New open-source benchmark from @scale_AI: MCP-Atlas
MCP-Atlas is a large-scale benchmark for evaluating tool-use competency using 36 real MCP servers and 220 tools. The benchmark was featured in recent model cards (GPT, Claude, Gemini), and now it's open-source!
🚀 Today we’re open-sourcing MCP Atlas — a large-scale, real-server benchmark for agentic tool use, which has been used in the recent GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash model releases!
🧠 Key insight: realistic agentic tool use is not a function-calling problem. It requires tool discovery, orchestration, and recovery in real environments.
🔧 MCP Atlas evaluates agents on real MCP servers (36 servers, 220 tools, 1K human-written tasks). Models must find the right tools, call them correctly, chain them together, and handle failures.
📉 What we found:
• Agents fail more often at tool interaction than at reasoning
• Performance drops sharply with real-world tool friction
• Scaling models helps unevenly, robustness remains hard
• Claims-based eval reveals how agents fail, not just if they finish
Check it out!
📄 Paper: https://t.co/WlUkPxHbUS
🌍 Environment: https://t.co/48QAhFiiZU
📂 Dataset: https://t.co/eCH9tbPDgm
📊 Leaderboard: https://t.co/dbFPkpqAFk
#AgenticAI #ToolUse #LLMEval #Benchmarks #MCP
We recently introduced MCP-Atlas, a benchmark for evaluating how well LLMs handle tool use via the Model Context Protocol. Even top models failed nearly half of realistic multi-tool tasks.
Today, we’re open-sourcing the benchmark so you can measure performance yourself.
🚀 Today we’re open-sourcing MCP Atlas — a large-scale, real-server benchmark for agentic tool use, which has been used in the recent GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash model releases!
🧠 Key insight: realistic agentic tool use is not a function-calling problem. It requires tool discovery, orchestration, and recovery in real environments.
🔧 MCP Atlas evaluates agents on real MCP servers (36 servers, 220 tools, 1K human-written tasks). Models must find the right tools, call them correctly, chain them together, and handle failures.
📉 What we found:
• Agents fail more often at tool interaction than at reasoning
• Performance drops sharply with real-world tool friction
• Scaling models helps unevenly, robustness remains hard
• Claims-based eval reveals how agents fail, not just if they finish
Check it out!
📄 Paper: https://t.co/WlUkPxHbUS
🌍 Environment: https://t.co/48QAhFiiZU
📂 Dataset: https://t.co/eCH9tbPDgm
📊 Leaderboard: https://t.co/dbFPkpqAFk
#AgenticAI #ToolUse #LLMEval #Benchmarks #MCP