gemini pro beats other models on scientific, data science, and a bunch of other categories of work. It dominates so many benchmarks.
on science (94.3% GPQA) and multimodal it's #1 outright. And it's the cheapest of the frontier.
It's just not good at agentic coding. They probably need to train more on that side. But who knows, they might have actually cooked this time.
Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks.
On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.
Now add two models to co-evolve: one learns to invent questions where hints help the most (finding weak spots) and the other learns to answer like it already had the hint (absorbing the fix).
Pushing each other harder every round.
What if the way we're training LLMs is the reason they plateau?
Most setups use an LLM to grade another LLM - model stops learning quality and it learns the grader's taste
this paper kills grader: https://t.co/Qa3v3eXkFE
So, it asks the model a question twice - once alone, once with a hint if the hint changed the answer a lot -> you just found a blind spot that's the whole training signal. no human. no judge. no "right answer"
By end of year I think 95%+ agent sessions will come from automations and events. We already see this happening @cognition where more than 50% of Devin customer sessions are triggered by non-humans.
Learning how to build these types of systems will be a valuable skill.
In this video I walk through how to get started with event-driven agentic systems with Devin, starting with transforming Slack into an agent-native control plane.
You can extend this to GitHub events, schedules, and arbitrary webhooks while maintaining traceability, auditability, and session attribution with @devinai.
Letting AI write tests after it builds a feature is a trap. It just writes tests to validate its own hallucinations.
Ask it to write tests first and then work and results improve. Claude does same for SWE-bench tasks