Personal update: I've joined Anthropic. I think the next few years at the frontier of LLMs will be especially formative. I am very excited to join the team here and get back to R&D. I remain deeply passionate about education and plan to resume my work on it in time.
Over the past month, some of you reported Claude Code's quality had slipped. We investigated, and published a post-mortem on the three issues we found.
All are fixed in v2.1.116+ and we’ve reset usage limits for all subscribers.
@claudeai Is chat compaction on the Desktop app just permanently deleting messages? 🥴I lost a lot of context I'd built up in a conversation. Is this expected behavior? It's a significant issue for my usage... Who's the right person to report this to?
cc: @bcherny@trq212
GPT-5.4 loses 54% of its retrieval accuracy going from 256K to 1M tokens. Opus 4.6 loses 15%.
Every major AI lab now claims a 1 million token context window. GPT-5.4 launched eight days ago with 1M. Gemini 3.1 Pro has had it. But the number on the spec sheet and the number that actually works are two very different things.
This chart uses MRCR v2, OpenAI’s own benchmark. It hides 8 identical pieces of information across a massive conversation and asks the model to find a specific one. Basically a stress test for “can you actually find what you need in 750,000 words of text.”
At 256K tokens, the models are close enough. Opus 4.6 scores 91.9%, Sonnet 4.6 hits 90.6%, GPT-5.4 sits at 79.3% (averaged across 128K to 256K, per the chart footnote). Scale to 1M and the curves blow apart. GPT-5.4 drops to 36.6%, finding the right answer about one in three times. Gemini 3.1 Pro falls to 25.9%. Opus 4.6 holds at 78.3%.
Researchers call this “context rot.” Chroma tested 18 frontier models in 2025 and found every single one got worse as input length increased. Most models decay exponentially. Opus barely bends.
Then there’s the pricing. Today’s announcement removes the long-context premium entirely. A 900K-token Opus 4.6 request now costs the same per-token rate as a 9K request, $5/$25 per million tokens. GPT-5.4 still charges 2x input and 1.5x output for anything over 272K tokens. So you pay more for a model that retrieves correctly about a third of the time at full context.
For anyone building agents that run for hours, processing legal docs across hundreds of pages, or loading entire codebases into one session, the only number that matters is whether the model can actually find what you put in. At 1M tokens, that gap between these models just got very wide.
A lot of people quote tweeted this as 1 year anniversary of vibe coding. Some retrospective -
I've had a Twitter account for 17 years now (omg) and I still can't predict my tweet engagement basically at all. This was a shower of thoughts throwaway tweet that I just fired off without thinking but somehow it minted a fitting name at the right moment for something that a lot of people were feeling at the same time, so here we are: vibe coding is now mentioned on my Wikipedia as a major memetic "contribution" and even its article is longer. lol
The one thing I'd add is that at the time, LLM capability was low enough that you'd mostly use vibe coding for fun throwaway projects, demos and explorations. It was good fun and it almost worked. Today (1 year later), programming via LLM agents is increasingly becoming a default workflow for professionals, except with more oversight and scrutiny. The goal is to claim the leverage from the use of agents but without any compromise on the quality of the software. Many people have tried to come up with a better name for this to differentiate it from vibe coding, personally my current favorite "agentic engineering":
- "agentic" because the new default is that you are not writing the code directly 99% of the time, you are orchestrating agents who do and acting as oversight.
- "engineering" to emphasize that there is an art & science and expertise to it. It's something you can learn and become better at, with its own depth of a different kind.
In 2026, we're likely to see continued improvements on both the model layer and the new agent layer. I feel excited about the product of the two and another year of progress.