With Claude Code / Codex and a management system I built myself, I took a months-long, multi-layered project from start to finish with ๐ฎ๐น๐บ๐ผ๐๐ ๐ป๐ผ ๐ฎ๐บ๐ป๐ฒ๐๐ถ๐ฎ.
The hardest part of a long AI project: every session you re-explain everything from scratch, you fix the same bug over and over, and a few months in you can't even remember why you set things up the way you did. I tried stronger models and longer prompts. Both just treat the symptom, not the cause.
So I rebuilt my whole workspace into a layered system, modeled on OpenClaw's structure. It remembers where the project stands and what the next step is. And it tells me which knowledge has gone stale before I trust it by mistake.
The system and the prompts behind it are all below. Save it, drop it into your next complex project.
The export ban on frontier models won't last long.
Too many incentives are stacked against it:
1โฃ The AI trade is already stretched, and policy risk has just entered the room ($NVDA -18% from the top)
2โฃ OpenAI/Anthropic are IPO'ing. A hard revenue ceiling on non-U.S. access would cut straight into 100s of billions of valuation.
3โฃ Washington canโt afford to kneecap the one narrative still holding up market leadership into midterms
4โฃ Chinese models are catching up anyway. The U.S. lead may be measured in months, not years
5โฃ Trump's oligarchs like Larry Ellison ($ORCL -40% MTD) severely impacted
6โฃ Most of the "sold GPUs" aren't really sold. They are future obligations, based on future hypothetical AI CAPEX (datacenter buildout).
TLDR: banned frontier models have a direct impact on the entire economy
The AI music can't stop without a disaster for the stock market.
๐จ I just open-sourced the Codex skill Iโve been using for months to get (much) better code quality from Codex xhigh.
It creates a supervised dev loopโฟ:
a worker implements, the supervisor reviews observable artifacts, then sends targeted feedback for the next round. All automatic! ๐
no benchmark will tell you this: LLMs can be /too/ nice
unsurprisingly, in a competitive zero-sum setting, being nice can be bad
i built royale: last agent standing, a br for agents, and ran it 30 times
the nicest model lost hard. the model you least expected, won
๐งต:
The SKILL (codex, but applies to claude as well) that has improved my output the most is what I call supervised development.
I spec the feature first, in detail, then I have one model implement it (usually 5.5 xhigh).
Then I bring in a completely fresh model (5.5 high works fine), with no ownership of the code, to review the implementation against the spec.
From there a loop starts that is controlled and observable through artifacts that the two models share (on disk) for each turn.
Introducing the OpenRouter MCP, live model intelligence right inside your agent
Your agent builds and ships, but when it comes to choosing the right model for the right job, it guesses from 6 month old training data
Watch it pick, price, and test the right model:
๐จ SCOOP(s):
- GPT-5.6 has been delayed and will no longer release this week. New target is ~mid-July.
- DeepMind are not satisfied with the current state of 3.5 Pro and it will no longer launch this month.
- Preparations for the launch of Bidi, OpenAI's new voice model, are underway in ChatGPT and we could see it available as soon as this week.
- Claude Sonnet 5 is currently available for select enterprise customers under an Early Access Program and is seen as a stop-gap as progress on getting Mythos/Fable 5 back out have stalled.
A bit of a disappointing end to the month, but July should prove more fruitful!
Interesting benchmark result from today.
Same model, same task suite, different reasoning.effort parameter (for @OpenRouter models).
Grok 4.3 higher reasoning setting scored *worse* than medium.
Not by a huge amount, but enough to look at the cases. ๐
The higher setting improved a few cases where broader analysis helped.
But it lost points in cases that required tighter execution and cleaner decision-making.
My benchmark on 5 semi complex tasks:
- Grok 4.3 scores 62.2 for $0.25
- Xiaomi MiMo 2.5 scores 60 for $0.09
So a 3% performance penalty for 65% of the cost.
More benchmarks and discoveries from using model APIs: some smaller models are much better than their popularity suggests.
Xiaomiโs MiMo, for example, gets little attention because everyone compares everything to subsidized frontier subscriptions ๐
But when you move from subsidized frontiers to APIs, you start to see who brute-forced intelligence with hardware (and money), and who actually engineered efficient models (to overcome hardware scarcity).