It's finally out!!! @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represents over 1000+ hours of maintainer validated software engineering work most frontier models cannot yet solve, much less solve with high quality.
Cog had IOI Gold medalists and top code maintainers Look At The Data — FrontierCode includes 3000+ rubrics covering code quality and anticheat reward hacking plaguing other benchmarks.
FC Diamond is so hard that Opus 4.8 scores 13.8%.
Three eras of AI coding : Three eras of benchmarks
2021 • Autocomplete : HumanEval
2023 • Passing Tests: SWEBench, TerminalBench
2026 • Maintainable Code: FrontierCode
to me the most beautiful chart when I requested a special historical run into all extant old models, the data was finding that the easiest third of FC tasks (in FC Extended) were rapidlly and suddenly solved over late 2025 - Opus almost doubled from a 41% pass rate to 74% in 4 months.
This describes the "WTF happened in Dec 2025" vibe shift that a lot of folks from @dhh to @karpathy have called out: it is the difference between getting 95% success in 2 rerolls vs 6, making it finally feasible to go up the next layer of abstraction in agentic coding, eg @GeoffreyHuntley's ralph loops or @bcherny's /goals or @steipete's "loops that prompt your agents" without fearing too much that things go off the rails.
My guess: as AI accelerates from here, each FrontierCode tier will saturate in sequence, hopefully ~annually. I've already asked the team to prepare FrontierCode 2027....
The old mountains will be destroyed. Their rubble becomes regolith. And from that regolith, the next model forest grows. Circle of life.
Cloudflare's security team spent the last few weeks testing Anthropic's Mythos against fifty of our own repositories. What we learned about offensive AI, why faster patching is the wrong reaction, and what the architecture around vulnerabilities has to look like next. https://t.co/RSrRtIhgaV
We’re training models wrong and it’s due to chatGPT. Even the modern coding agents used daily still use message-based exchanges: They send messages to users, to themselves (CoT) and to tools, and receive messages in turn.
This bottlenecks even very intelligent agents to a single stream. The models cannot read while writing, cannot act while thinking and cannot think while processing information.
In our new paper, see below, we discuss LLMs with parallel streams. We show that multi-stream LLMs can …
🔵Be created by instruction-tuning for the stream format
🔵Simplify user and tool use UX removing many pain points with agents and chat models (such as having to interrupt the model to get a word in)
🔵Multi-Stream LLMs are fast, they can predict+read tokens in all streams in parallel in each forward pass, improving latency
🔵 LLMs with multiple streams have an easier time encoding a separation of concerns, improving security
🔵 LLMs with many internal streams provide a legible form of parallel/cont. reasoning. Even if the main CoT stream is accidentally pressured or too focused on a particular task to voice concerns, other internal streams can subvocalize concerns that would otherwise not be verbalized.
Does this sound related to a recent thinky post :) - Yes, but I don’t feel so bad about being outshipped with such a cool report on their side by 23 hours. I’ll link a 2nd thread below with a more direct comparison. I actually think both are complementary in interesting ways.
Whether it’s existing consulting firms, new ones that emerge, FDEs from agent vendors, or new internal agent engineering roles, the amount of work that is going to be created to implement agents in enterprises will exceed anything we imagine today.
The complexity of implementing agents in any existing organizations is very real. When I talk to large enterprises, as you move from a chat paradigm to agents that participate in meaningful workflows, there are a number of things they need to do.
First, you have to get agents to be able to talk to your data securely across your systems. In many cases, enterprises have decades of legacy infrastructure that contain the valuable context for AI agents. That’s going to take a ton of work to go modernize and move to systems that work well with agents.
Then, you need to ensure that you’ve implemented agents with the right access controls and entitlements, the right scopes to be safely used, and have ways of monitoring, logging, and securing the work that they do.
Next, you need to actually document the processes in the organization in a way that agents can utilize for doing the work. You also need to figure out what the new workflow looks like when agents and people are working together on a process, and who steps in where. Just replicating the old workflow will mute the gains. Oh and you likely need to create evals for your top new end-state processes.
Finally, you have to keep up with a rapidly changing set of best practices and architectural shifts happening in the agent space. While it’s fun for people to change their personal productivity tools on a dime, it’s 100X harder to do this in a business process. The speed of change is a blessing and a curse right now for anyone trying to keep a stable system design.
All of this means that individuals and companies that develop expertise on the above set of components (and more) are going to be needed to help organizations actually implement agents at scale. This is also the rationale for vertical AI agents right now that can go in deep on a business domain and help bring automation to it.
This is a huge opportunity right now whether you’re doing this internally or as an external business provider.
New work with @AlecRad and @DavidDuvenaud:
Have you ever dreamed of talking to someone from the past? Introducing talkie, a 13B model trained only on pre-1931 text.
Vintage models should help us to understand how LMs generalize (e.g., can we teach talkie to code?). Thread:
My dear front-end developers (and anyone who’s interested in the future of interfaces):
I have crawled through depths of hell to bring you, for the foreseeable years, one of the more important foundational pieces of UI engineering (if not in implementation then certainly at least in concept):
Fast, accurate and comprehensive userland text measurement algorithm in pure TypeScript, usable for laying out entire web pages without CSS, bypassing DOM measurements and reflow
Is it weird that AI coding assistance is not giving me identity fracture?
A lot of software developers are feeling disoriented and threatened these days. Programming by hand is clearly going the way of the buggy whip and the hand-cranked auger. Which is how we're finding out that a lot of people have their identities bound up in being good at hand-coding and how it feels to do that.
That's not me. It's not me at all. Rather to my surprise, I don't miss coding by hand, not any more than I missed writing assembler when compilers ate the world and made that unnecessary. (That was in a couple years back around 1983, for you youngsters.)
Maybe the fact that I'm not feeling any of this disorientation disqualifies me from having anything to say to people who are. On the other hand...if you can learn to emulate my mental stance and be completely unbothered, maybe that would be a good thing?
So. If you're a programmer, and you're feeling disoriented, try this on for size:
I like being a wizard. I like being able to speak spells, to weave complex patterns of logic that make things happen in the world. Writing code is a way to manifest my will.
Yes, I've piled up a lot of arcane knowledge over the 50 years I've been doing this. But languages of invocation, they come and they go. Been a long time since I've had any use for being able to program in 8086 assembler, and that's okay. I have better spells now, and these days some rather powerful familiars.
What I'm inviting you to do is think of yourself as a wizard. Not as a person who writes code, but as a person who is good at assuming the kind of mental states required to bend reality with the application of spells.
And if that's who you are, does it matter if the spells are painstakingly scribed in runes of power, versus being spoken to an obedient machine spirit?
It's all one; it's all the manifestation of will. Arcane languages come and go, machine spirits appear and then diminish to be replaced by more powerful ones, but you? You are the magic-wielder. Without you, none of it happens.
Same as it ever was. Same is it ever was. And so mote it be.
Introducing DeepConf: Deep Think with Confidence
🚀 First method to achieve 99.9% on AIME 2025 with open-source models! Using GPT-OSS-120B even without tools, we reached this almost-perfect accuracy while saving up to 85% generated tokens.
It also delivers many strong advantages for parallel thinking:
🔥 Performance boost: ~10% accuracy across models & datasets
⚡ Ultra-efficient: Up to 85% fewer tokens generated
🔧 Plug & play: Works with ANY existing model - zero training needed (no hyperparameter tuning as well!)
⭐ Easy to deploy: Just ~50 lines of code in vLLM (see PR below)
📚 Paper: https://t.co/jnBnRzQczh
🌐 Project: https://t.co/kGq1kATTu0
joint work with: @FuYichao123 , xuewei_wang, @tydsh
(see details in the comments below)
how can engineers (or anyone) get better at design?
i get this question a lot. as someone who went from cs to design, here's a real path:
start with systems thinking
engs already get this. design is just systems for humans & our senses, instead of machines. read “thinking in systems” if you haven't.
learn the fundamentals
there are systems we humans evolved for centuries to visually present and receive information. you cannot escape these even for the cli:
- typography — start with “details in typography” by jost hochuli
- color basics — start with “interaction of color” by josef albers
- grid systems — start with “grid systems” by josef müller-brockmann
- visual hierarchy, reading rhythm, symbols & conceptual systems, motion, accessibility, … — you’ll pick more up as you go
open your eyes & brain
look at things around you, digital and natural. observe the beauty and sameness in everything. think why is it made this way. make connections between what you observe and what you think and make. break away from rigidity, linear thinking, let loose. stare at the sky, do nothing. see through everything.
then just start making stuff
now that you notice things, try to make things better, your way first. redesign apps you use daily. copy designs you love pixel by pixel — you'll learn more in a week than months of theory. then share it with others, get feedback, and design for more people.
key mindset shift: feelings first
stop optimizing for the computer, start optimizing for the human. engineers think in edge cases and error states. designers think in happy paths and emotions. the feelings and how things fold together ends up to be way more important to humans than the edge cases.
tools don't matter much
Figma is industry standard. learn it in a weekend (it’s basically visual flexbox). use Cursor to dismantle and prototype with existing design systems and study how they are built — frontends go deep.
most important: find your design voice through constraints. pick one great typeface, limited color palette, and make 10 different layouts. constraints breed creativity. iteration is how you get there.
great engineers already understand systems, logic, and problem-solving. just need to apply that to human concepts and problems instead of technical ones.
start tomorrow. redesign your personal website or a simple app. ship it. share it. repeat.
In our new paper, “Contextualized Evaluations: Judging Language Model Responses to Underspecified Queries,” we find that adding just a bit of missing context can reorder model leaderboards—and surface hidden biases. 🧵👇