Claude Fable 5 is now available in Devin.
Fable 5 earns the #1 spot on FrontierCode, our benchmark for real-world engineering tasks that grades mergeability and quality:
SWE-bench makes it seem like we already basically solved coding (everyone 50+%)
FrontierCode shows how much room we still have left (nobody beats 13.4%)
FrontierCode is the best attempt I've seen to measure the gap between what AI can produce from a single shot prompt and what's actually needed to produce maintainable code.
Use it to stop the slop.
I spend most of my time looking at production agent traces, and FrontierCode is the first benchmark that actually comes close to reflecting real software engineering work and success criteria.
Code correctness is only one component of whether an agent's work is useful - the code also has to be maintainable, well-scoped, and well-tested.
Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.
Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?
Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.
Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?
Measuring someone's productivity by their token usage is a horrible idea. Giving everyone the same fixed token budget isn't much better. So what's the right way to roll out AI across your org?
We built a system to measure how many productive engineering hours every Devin task is worth, validated against a dataset of real engineers’ times estimates. The goal is to answer the fundamental question that companies are grappling with: how much real value are you getting from each of your agent sessions?
On top of that, we're giving an AI productivity guarantee! Now if Devin delivers less engineering value than you're paying for, we fund your usage until it does.
The whole industry needs to move from measuring activity to measuring output. We hope to see more AI companies taking this approach.
AI should earn its keep. Introducing the AI Productivity Guarantee.
If Devin delivers less engineering value than you’re paying for, Cognition will fund your usage until it does, up to $10 million.
It’s time for the AI industry to stop maximizing tokens and start maximizing productive output.
CTO of Cognition (@stevenkplus1) just joined the leaderboard...
and Devin flipped Claude over 24 hour period!!!
join the benchmax proof benchmark : ecdsa(.)fail
Looks like @DevinAI the best research agent out there?
Nikhil from Devin is moving the frontier forward by reducing our the number of qubits our circuit requires.
Looks like @DevinAI the best research agent out there?
Nikhil from Devin is moving the frontier forward by reducing our the number of qubits our circuit requires.
Standalone IDEs have about 6 months left to live. An interface for manually editing and refactoring doesn’t need to exist if you're not manually editing and refactoring anymore.
So what's the right interface for a dev to be working in for 8h / day? Some parts are obvious: you want to be able to spin up agents (either local or cloud agents) and to have a clean interface to keep up with all of your parallel running agents. Then you want to be able to get into the weeds whenever needed for last-mile fixes and review.
But as software engineering continues to evolve we will see more and more of the lifecycle get reinvented. How do you build a single surface that allows you to plan, spec, prototype, debug, review, QA?
Bringing Devin and Windsurf together has been our vision ever since the acquisition. Devin Desktop is our first shot at what this looks like. Excited to make this a reality today!
Introducing Devin Desktop: the next generation of Windsurf
Manage fleets of local and cloud agents from one surface
Support for any ACP-compatible agent
With a full IDE for when you need to jump into the code