Opus 4.8 dropped yesterday and the timeline did the usual thing — everyone racing to post the win column.
Here's the column nobody's screenshotting.
Anthropic published 7 headline benchmarks. Opus 4.8 wins 6. SWE-bench Pro jumped 64.3 → 69.2. Verified went 87.6 → 88.6. OSWorld nudged to 83.4. Same $5/$25 pricing as 4.7, shipped 41 days after it. Clean upgrade on paper.
Then there's the seventh.
Terminal-Bench 2.1 — raw command-line agent loops — GPT-5.5 still sits on top at 78.2. Opus 4.8 is 74.6. Anthropic moved it up a full 8 points from 4.7 and still didn't catch OpenAI on that one board.
So the honest read isn't "Opus won." It's "Opus won everything except the thing a lot of you do all day."
If your work is multi-file refactors, long autonomous runs, codebase-wide stuff — switch. It's the strongest option right now and it's not close on SWE-bench Pro.
If your work is heavy terminal sequencing — chaining shell, pasting commands, that whole loop — GPT-5.5 is still ahead. Re-pointing your default to 4.8 today might quietly cost you there. Worth knowing before you flip it.
The upgrade I actually care about isn't on the leaderboard though.
4.8 is roughly 4x less likely than 4.7 to let a flaw in its own code slide by unflagged. That's the one that matters when you've got an agent running unattended for an hour — the old failure mode was it telling you the task's done while the code's quietly broken. Less of that now.
honestly that's worth more to me than a point on SWE-bench.
Same price, better at catching itself, loses one board to GPT-5.5. That's the actual picture — not the win column.
Still testing the terminal gap myself. Will report back.
I ran the same 3 prompts on Claude Code and Codex.The numbers broke my brain.
Codex burned 1.64M tokens building a dashboard. Claude did the same dashboard in 283K.
Almost 6x leaner on Claude's side. Same prompt, same output, same desktop app workflow.
So Claude is cheaper, right?
Nope. Claude cost me more.
Here's the thing nobody mentions in these comparison videos — total tokens is a trap. The number that actually drains your session limit is output tokens, and Claude writes way more of them.
Across 3 builds (research PDF, landing page, marketing dashboard):
The dashboard build burned 83K output tokens on Claude versus 18K on Codex. API cost: $11 versus $7.
The landing page — 80K versus 20K.
The research report — 41K versus 16K.
Across all three builds Claude wrote 2-5x more output. Every single time.
Every single build, Claude wrote 2-5x more output than Codex. And output tokens cost more than input tokens — like a lot more on most pricing tables.
This is why people on X have been screaming about Claude Code burning through Max plans in two days. It's not that Opus 4.7 is dumber with context. It's the opposite — it plans aggressively, writes verbose code, adds comments, restructures things twice. Codex just... shuts up and ships.
Honestly? I didn't expect this when I started. I went in assuming "more total tokens = more expensive = hits limits faster." Reverse turned out to be true.
The practical takeaway — if you're hitting Claude Code session limits before 5pm and you've already upgraded to Max 20x, the fix isn't another tier. It's recognizing where the bleed is.
Output tokens.
A few things that helped me stretch sessions further on Opus:
— Tell it explicitly to write concise code. Not "make it short" — "minimize commentary and avoid restructuring unless necessary." — Use Sonnet for execution after planning with Opus. Sonnet writes leaner output for the same task. — Avoid /loop and auto-delegating subagents on small jobs. Each subagent run = full output budget. — Watch the JSONL session log. Claude can read its own logs and tell you exactly where output spiked — I asked it once and got a breakdown in like 30 seconds.
The weird thing is the same pattern shows up in the visual results. Claude's dashboard looked better — dark mode, nicer hovers, gradient bars on the funnel. It's putting that quality somewhere, and "somewhere" is output tokens.
Codex's dashboard worked. It just felt cheaper. Functional sameness, aesthetic distance.
So yeah — the trade is real. You're not picking the cheaper tool, you're picking what you spend tokens on. Claude burns output buying you polish. Codex burns input buying you iterations.
I switched to running planning on Claude and execution on Codex about 3 weeks ago and the session limit anxiety basically went away. Different tools, different jobs. The "which is better" framing was wrong from the start.
Anyway — if your /status is at 12% by Wednesday, check your output token ratio before you upgrade. Might save you $100.
Thanks for reading
THIS GUY BUILT A FULLY LOCAL AI ASSISTANT THAT FITS IN YOUR HAND AND DOESN'T COST A CENT TO RUN
everyone's paying $20/mo for ChatGPT and praying their data isn't being sold somewhere
he just built his own. runs entirely on a raspberry pi, completely offline, 100% private
he calls it Pocket
here's how it works:
1\ you talk to it, fast whisper transcribes your voice locally
2\ a small router model decides if your question is simple or complex
3\ simple stuff goes to qwen non-thinking mode (instant reply)
4\ complex stuff goes to qwen with thinking mode on (slower but smart)
5\ piper tts speaks the answer back to you
the stack:
> raspberry pi 5 (16gb ram) as the main brain
> qwen as the reasoning model with thinking on/off routing
> function gemma 270m for tool calls, fine-tuned on his own dataset
> fast whisper for speech to text
> piper tts medium for text to speech
> hailo 8 hat to run computer vision without bogging down the pi
> 4.3 inch touchscreen with custom ui
> two 18650 batteries for 1-2 hours of untethered use
> 3d printed case with ventilation because the whole thing runs hot
the smart part is the routing. you don't want every "hello, how are you" going through a 19-second thinking process. so simple prompts skip thinking entirely, complex ones get the full reasoning
but the real kicker is the tool calling. function gemma is only 270m parameters which means it loads in milliseconds. he fine-tuned it on his own functions and got it to 100% accuracy on his test set
so now Pocket can pull live weather, search the web, scan your local network for unknown devices, check stock prices, all running locally with one tiny model doing the function selection
he also added scheduled tasks. you can tell it to fetch the weather in atlanta every morning at 7am and it just does it. no cloud cron, no subscription, no api keys
and because there's a camera + the hailo hat, you press one button and it does real-time object detection without touching the language models
if you've been waiting for a personal AI that isn't owned by openai or anthropic or google, this is the blueprint
he open sourced the whole thing. the code, the CAD files, the fine-tuning data
you can build your own this weekend
@SaharaAI@ThisIsJoules@ArjunKalsy the quiet part people miss is agents see what we feed them not the full picture
guess wednesday answers how many are actually giga chain nodes in disguise