The AI cost crisis is real. The diagnosis isn't.
Microsoft just told ~100,000 engineers to drop Claude Code by June 30. Officially: "standardizing on Copilot CLI." Unofficially (per @tomwarren 's leaked memo + Fortune): the bills got brutal. Uber burned $3.4B — its 2026 AI budget — in four months. Per-engineer Claude Code spend: $500–$2,000/month.
The frame everyone's using: AI is too expensive.
The frame that's true: most teams don't know how to use it.
Where most of your tokens are actually burning.
You've watched this. The first 15 minutes of a session are sharp. By minute 40 the agent starts looping — tries the same broken approach three times, forgets the spec you handed it at minute 1. By hour 3 you're paying it to confidently invent the wrong fix and ship it as a regression you'll spend tomorrow undoing.
That long-session drift is the cost line item. It's also the regression line item. Hallucination compounds with context decay compounds with steering loss — and you keep paying for tokens that introduce more work than they finish.
The benchmark math is brutal. APEX-Agents tests legal, consulting, and analyst tasks that take humans 1–2 hours. Frontier models that score 90%+ on standard coding benchmarks complete those real-work tasks 24% of the time. After 8 retries: 40%.
The diagnosis was consistent across every model: agents got lost after too many steps, looped back to approaches that had already failed, lost track of what they were supposed to be doing. The steering instructions from step 1 got buried under hundreds of intermediate tool results.
That's not a model problem.
Three receipts that prove a tighter harness wins.
Vercel stripped 80% of their text-to-SQL agent's tools. Accuracy went 80% → 100%. Tokens dropped 40%. 3.5× faster.
Cursor drove tool-call errors down 10× by tuning the harness to the model's actual training format (patch for OpenAI, string-replace for Anthropic). Each retry that doesn't happen is a token you don't pay for.
Manus rebuilt their agent framework five times in six months. The wins came from removing features. Average task: 50 tool calls — long enough that the steering prompt gets evicted before the agent reaches it again.
And the same Opus 4.5 scores 45.9% / 50.2% / 55.4% on the same SweepBench Pro tasks depending on whether it's running on a minimal scaffold, Cursor's harness, or Claude Code's. Same model, different wrapper. You're paying Opus prices and getting whichever number your scaffolding earned.
@AnthropicAI's own engineering blog confirms the shape: same model + complicated harness was 20× more expensive — but the output quality jump was immediately apparent. Same model. Different scaffolding.
The right question in 2026.
Stop asking which model is best. Ask which harness around a specific model is the best one.
Opus and Sonnet aren't always the answer. Vibe-coding a 50-tool MCP loop around a frontier model and watching it drift for four hours isn't a model problem. It's discipline.
This is the lane I'm building in.
I'm shipping an IDE that authors agent harnesses — Planner / Generator / Evaluator loop, bounded context, a handful of skills instead of 50 MCP tools, verifier loops that catch "made up" before it ships to prod.
If your bills are exploding, the model isn't the issue. It's what's wrapped around it.
Early access: https://t.co/CZj3G4oS5U
source: https://t.co/7N7VIJr7zH
Started building https://t.co/9prFC6I5UG as "agency for agentic workflows." Green field, all that.
Then I tried deploying vanilla agents and hit the wall every Claude Code user knows. They hallucinate. They build skills that drift from your workflow. Give them real tools and they take actions you didn't ask for. The bigger the toolbox, the worse the damage.
The Agent isn't the problem — OpenClaw / Hermes both nail the loop. The Agent is not a Harness Engineer — it doesn't understand how to develop a reliable harness.
So I'm building an IDE for that. Spec → skills → tools → deploy, with the platform showing you what should be a tool vs. a skill, where approval gates go, which context lives where. Right components, right places, right agent.
Built on top of @steipete's OpenClaw.
Early. But already hosting client agents on it.
-- Also: unfollow anyone telling you their OpenClaw / Hermes agent is making money on its own, running their entire business, or "is their CEO" 😂. That's literally a psyop to burn as many tokens as possible by big tech.
#stoptheslop
Recently unemployed. Decided building https://t.co/MII258KqUm instead of looking for a job.
Many nights I get anxious wondering how to provide for my family. But i know in my heart, God who's carried me this far isn't going to stop now.
https://t.co/vJDejjcW30
@SmokeyTheBera@DefiIgnas Rev should be the only relevant revenue stream for L1 token holders. In fact they are buybacks as staking usually compounds all rewards into the L1 token! DeFi demand for an L1 token also creates a lot of buying pressure but only relevant if the ecosystem’s thriving.
LOVE when @solana is completely down when i try to repay my loan and only works after i got liquidated! @toly already start implementing an auctioning system for MEV or a priority FEE! Beglilion TPS but my TX takes more than 30 min to go through! Worst UX ever!
@MaxBecauseBTC@base Oh boss you are right in time for the party! MR $FINK already set his eyes on the base chain and is ready to pump the RWA narrative $ElonRWA