@gregisenberg The pricing model is where it breaks down though. I'm building agent products and can't figure out if I should bill per task, per outcome, or just eat the compute cost and hope margins work out.
@AravSrinivas I feel this building agents. One amazing demo run is easy, but getting the floor to where users don't hit a broken tool call every 10th session is where all the real time goes.
@RussBaer@gregisenberg We actually do something similar, our agents append to a decision log that survives session restarts. The hard part isn't recording decisions, it's pruning stale ones so the context window doesn't bloat.
GPT-5.6 Sol previewed Jun 26 and I can't even use it yet. OpenAI's shipping it as a limited preview to partners they cleared with the US government first, and that gate matters way more to me than the benchmarks.
@vasuman I've seen this play out in dev tooling too. Claude Code won because it lives in your terminal, not some new app you have to context-switch into.
"Use the newest model" stopped being a strategy the moment one launch gave you three cost-vs-capability points.
Routing per task is the only one left. Your swap layer just became more important than your model pick.
GPT-5.6 quietly made the version number meaningless, and that's the useful part for builders.
It ships as three models: Sol keeps the GPT-5.5 rate card ($5/$30 per 1M), Terra does roughly GPT-5.5-class work at half the price ($2.50/$15), Luna goes cheaper still.
@steipete Worst part is when your CI goes silent and you spend 30 min debugging before realizing it's a legal agreement. I can't believe they don't even send an email first.
@vasuman I've seen this with dev tools too. Agents that make you learn a new UI get dropped, but the ones plugging into your existing terminal don't even feel like AI anymore.
Introducing a limited preview of GPT-5.6 Sol, our next generation frontier model, as well as GPT-5.6 Terra, a balanced model for efficient, everyday work, and GPT-5.6 Luna, a fast and affordable model for high-volume work.
https://t.co/OoM83SyISN
@gregisenberg I've been building agents and this word hijack kills me. Users say 'ask chat' expecting instant replies, but my stuff chains 5 tool calls and needs 30 seconds to think.
@gregisenberg I've rebuilt my agent stack 3 times in 6 months because the tooling keeps shifting. Distribution is the one that actually compounds, the rest keep resetting.
@daveholtz The legal/recruiting numbers are the real story. I've been building agent workflows and the 'works for devs, breaks for everyone else' gap is brutal, so the skills standardization angle they found gives me some hope.
@_philschmid Biggest thing is it's native to the model, not an orchestration wrapper. Every browser agent I've built hits a wall where tool-call roundtrip latency makes it slower than just doing it yourself.
Computer-use scores quietly converged. Gemini 3.5 Flash just hit 78.4 on OSWorld, basically tied with GPT-5.5 (78.7) and Opus 4.7 (78.0).
So the benchmark isn't the story now. What I actually care about: does it take over my screen, or run in the background while I keep working?
@karpathy The async + persistent part is where the engineering actually hurts. Once Claude lives in the channel instead of a chat window, stale context and "who approved this" auditing stop being edge cases. It demos in a minute, but keeping it "just working" is basically the whole job.
@gregisenberg I've been wiring up something like this for agent memory. The 12-hour vault build is easy, but mine went stale in under a week because nothing piped new context back in automatically.