@ctjlewis It is more complicated than that however: most providers brainfried their models with RL to go prompt -> outcome. Anthropic models can maintain competing hypothesis and tension for far longer, which allows thesis, investigation, correction, and autonomy.
@jonas@dbreunig@karpathy@grok We did Redis last July: https://t.co/B82cjMvQqf. The truth is, specification is the bottleneck (and a well known system shortcuts that).
Introducing Claude Opus 4.5: the best model in the world for coding, agents, and computer use.
Opus 4.5 is a step forward in what AI systems can do, and a preview of larger changes to how work gets done.
@Yuchenj_UW Verifiable engineering tasks, mostly at the scale of entire features, primarily in compiled languages like go or rust, where it can validate the outputs and hill climb. One such public example: https://t.co/B82cjMviAH with repo of results and all commit/session history.
@_xjdr It natively wants you to be absolutely right (it is Sonnet after all). However, unlike 4.0, it will listen to style and tone instructions on a long horizon basis.
@Yuchenj_UW Verifiable engineering tasks, mostly at the scale of entire features, primarily in compiled languages like go or rust, where it can validate the outputs and hill climb. One such public example: https://t.co/B82cjMviAH with repo of results and all commit/session history.
The world is rapidly bifurcating between those who have experienced the superhuman capabilities of even current generation models, used properly, and those stuck in the capabilities of the past.
We're excited to share that our agent, Maestro, drafted solutions to all 12 problems from ICPC 2025 World Finals in ~2 hours - using current models, no human involvement, no internet access. We deeply respect the human teams' extraordinary dedication. Note: no official validation
@SimonInmania@matthewclifford Yes, task length is in Replit’s case measured in agent runtime, which can slightly proxy for the metric that really matters: how many man months (or years) of work the agent performs. https://t.co/B82cjMviAH is several man years in the course of ~70 hours or agent time however.
@simonw Strong recommendation: it takes a very different promoting strategy to maintain attendance across that context: basic system prompt + context + request alone won’t cut it.
@platonovadim@_xjdr Yes, tests are central including leveraging the existing redis client libraries and benchmarks. The entire commit history also shows every round of the AI workflow during building.
It has become clear there is a massive performance and productivity delta growing between engineers who understand and embrace AI, with appropriate tooling and critical analysis, and those who have have remained in the co-pilot era. Never before has it been so possible for those who know what they are doing, to build so much.
Tired of toy AI demos that fizzle in production? iGentAI built Ferrous: A Rust Redis-compatible server outperforming Valkey. 35KLOC, 100% test passing, beats benchmarks. Zero human code. Built in 70 hours of part-time direction. Toys vs. tools—here's the proof.
Our VibeCodeBench evaluations affirm what @Anthropic just announced: Claude Sonnet 4 excels at autonomous multi-feature development. We've seen codebase navigation errors drop from 20% to near zero and strategic refactoring that saves ~500k tokens on multi stage, complex tasks. Proud to power Maestro with this breakthrough.
It's always great hosting @AITinkerers London meetups right after a new model drops...
Huge thanks to @rebecca_harbeck from @AnthropicAI, as well as the @iGent_AI team @MSzummer and @samshapley for giving impromptu talks with tons of learnings from early access Claude Sonnet 3.7.
We also got to see @HarryCoppock from the @AISecurityInst live demoing 3.7 hacking into a docker container 🫢
And as always, we had some fantastic product, behind-the-scenes and benchmark beating agent talks from Emma Burrows, @moeadham and Sergei Petrov.
Huge thanks to team @localglobevc and @ferdisigona for making it happen!
At @iGent_AI, we’ve found @AnthropicAI new Sonnet 3.7 to be quite the powerhouse. Everything from debugging multi language distributed systems, to comprehending and updating legacy codebases, to rapid prototypes or POCS of new technologies. Agentic SWE is now here.
"Agency > Intelligence"
@karpathy nailed it, and after 18 months building Maestro, we agree. The real AI leap isn’t just smarts—it’s agency: the ability to act independently, turning assistants into partners.