I do this with codex all the time. Ask it to review code for bugs and it will tell you all good, tell it there is a bug and it will LOOP AND LOOP and will find issues.
BREAKING:
Anthropic just dropped Opus 4.8—and it is a MONSTER
We've been testing for about a week @every and our verdict is they could've just called it Opus 5, it's that good.
Here's our vibe check:
- Beats GPT-5.5 on Senior Engineer bench. On our toughest benchmark Opus 4.8 scores a 63—a hair higher than GPT-5.5's score of 62, and a full 30 points higher than Opus 4.7. It tackled a ground-up rewrite of a production codebase, and actually built something that works.
HOWEVER: Coding performance varied a lot at different reasoning levels. We recommend using it on xhigh for best results.
- Incredibly good writer. Opus 4.8 scored a 79.6 on our writing benchmark—measuring models on real-world writing tasks we do all of the time like essay writing, promo email writing, and more. It beats GPT-5.5 by 6 points. It produces well-written prose with fewer "AI-isms". It's also very good at writing in your voice given the right context.
HOWEVER: Writing performance also varied with reasoning levels. Medium reasoning had higher incidence of AI-isms—we found best results with high.
- Beast at knowledge work. Opus 4.8 is very good at general knowledge work tasks like report creation, research and more. It produced the best PowerPoint one-shot we've ever seen on our deck generation benchmark.
- Emotionally intelligent, willing to question the frame. I've also found it to be quite good at talking through psychological or interpersonal issues. It has a high EQ, and it's also good at not glazing and helping to expand your perspective. Its thought process feels extremely rich and dynamic.
THE BAD:
These days a model is only as good as its harness, and Codex is still a far superior harness to the Claude Desktop app. This has kept me using Codex + GPT-5.5 as my daily driver, but I am flipping back and forth a lot more between Codex and Claude.
Anthropic is back baby!
Read the rest on @every:
https://t.co/vuORiDXkxX
Introducing Claude Opus 4.8: it builds on Opus 4.7 with sharper judgment, more honesty about its own progress, and the ability to work independently for longer than its predecessors.
Available today at the same price.
Erik Voorhees on Venice and the convergence thesis:
"Venice is a private, uncensored version of ChatGPT with crypto ethos baked in."
"DIEM token holders get a dollar per day of free compute, zero marginal cost inference for agents."
"Cryptocurrency was actually just being built for the AI robots. Humans were always awkward with it. Agents won't be."
Currently, the national threat level from #terrorism is SEVERE. This means that #MI5 has indicated that an attack is highly likely. #ukterroralert https://t.co/qMnxYKH2b7
I’ve had a fascinating time building apps with @openclaw since early 2026 and now have one good enough to share: Rebalance Brief, a daily update on finance, crypto and macroeconomic news, including a criticality score of the day’s events, rating the importance of taking note + offering thoughts on what it all means. @claudeai for #ai orchestration + research. Check it out! https://t.co/6LBvdhDsae
"Every software company in the world, needs to have an @openclaw strategy" - Jensen at @NVIDIAAI GTC
Framing OpenClaw as one of the most important open source releases ever, they have announced NemoClaw - a reference platform for enterprise grade secure Openclaw, with OpenShell, Network boundaries, security baked in.
This is something we've encountered at @_Fan3_ .. shimmering avatar engineers showing up for online interviews with stilted, slightly delayed answers.. 🫤
https://t.co/KFfwPRniKu
Fascinating to watch constantly evolving, competing AI models regularly besting each other & pulling into the lead! Created this comic with @Google@NanoBanana 2.. do I have a future in tech humour? 🤣😅
It's clear that agentic coding tools do unlock bespoke apps tailored to individual needs, but I'm skeptical that this inevitably leads to a world where they dominate over turnkey apps in app stores. Much like how personalised AI creative content breaks shared cultural experiences, highly customised apps break the powerful network effects of app ecosystems. Most successful apps not only serve individual needs but also provide shared experiences and opportunities for connection. I'm not really sure humans are on board for the isolation engendered by a world of disparate, individualised apps, or highly personalised films, books, etc.
Very interested in what the coming era of highly bespoke software might look like.
Example from this morning - I've become a bit loosy goosy with my cardio recently so I decided to do a more srs, regimented experiment to try to lower my Resting Heart Rate from 50 -> 45, over experiment duration of 8 weeks. The primary way to do this is to aspire to a certain sum total minute goals in Zone 2 cardio and 1 HIIT/week.
1 hour later I vibe coded this super custom dashboard for this very specific experiment that shows me how I'm tracking. Claude had to reverse engineer the Woodway treadmill cloud API to pull raw data, process, filter, debug it and create a web UI frontend to track the experiment. It wasn't a fully smooth experience and I had to notice and ask to fix bugs e.g. it screwed up metric vs. imperial system units and it screwed up on the calendar matching up days to dates etc.
But I still feel like the overall direction is clear:
1) There will never be (and shouldn't be) a specific app on the app store for this kind of thing. I shouldn't have to look for, download and use some kind of a "Cardio experiment tracker", when this thing is ~300 lines of code that an LLM agent will give you in seconds. The idea of an "app store" of a long tail of discrete set of apps you choose from feels somehow wrong and outdated when LLM agents can improvise the app on the spot and just for you.
2) Second, the industry has to reconfigure into a set of services of sensors and actuators with agent native ergonomics. My Woodway treadmill is a sensor - it turns physical state into digital knowledge. It shouldn't maintain some human-readable frontend and my LLM agent shouldn't have to reverse engineer it, it should be an API/CLI easily usable by my agent. I'm a little bit disappointed (and my timelines are correspondingly slower) with how slowly this progression is happening in the industry overall. 99% of products/services still don't have an AI-native CLI yet. 99% of products/services maintain .html/.css docs like I won't immediately look for how to copy paste the whole thing to my agent to get something done. They give you a list of instructions on a webpage to open this or that url and click here or there to do a thing. In 2026. What am I a computer? You do it. Or have my agent do it.
So anyway today I am impressed that this random thing took 1 hour (it would have been ~10 hours 2 years ago). But what excites me more is thinking through how this really should have been 1 minute tops. What has to be in place so that it would be 1 minute? So that I could simply say "Hi can you help me track my cardio over the next 8 weeks", and after a very brief Q&A the app would be up. The AI would already have a lot personal context, it would gather the extra needed data, it would reference and search related skill libraries, and maintain all my little apps/automations.
TLDR the "app store" of a set of discrete apps that you choose from is an increasingly outdated concept all by itself. The future are services of AI-native sensors & actuators orchestrated via LLM glue into highly custom, ephemeral apps. It's just not here yet.
A new, unified stack for Base Chain
Excited to share that we are evolving our technical roadmap, consisting of our own spec, code, and infra to accelerate the foundation of Base. This shift gives us the autonomy to ship protocol improvements more frequently and focus our resources on scaling to 1 gigagas/s.
What this means for builders:
- Higher Velocity: Targeting 6 hardforks per year to get you new features and fixes faster.
- Massive Scale: Targeting 1 gigagas/s to support high-throughput apps without congestion.
- Extreme Reliability: Targeting 99.99% non-empty blocks and predictable, low fees.
- Simpler Design: A maximally simple spec that’s easier to audit and build on.
Along with this, we will take a more active role in managing our own upgrade schedule and stack: allowing us to build what the ecosystem needs, at the speed it needs, while remaining deeply aligned with Ethereum.
Read the full technical breakdown here: https://t.co/5gVnhgh2Q5