@RMantri@HDFC_Bank That's a anomaly detection at play, Even the point of failures are two. And you stupid might also think that when transaction already completed why call, they should have called before.
For European teams, this is the line where Codex stops being “chat for code” and starts becoming workstation automation.
The adoption question shifts from “is the model smart?” to “which apps can it touch, what does it remember, and can we audit/turn it off?” Memory being off by default in these regions is the right boring detail. That’s what makes this usable at work.
If the Forbes report is right, the big signal isn’t “$60B for an editor.” It’s that the AI IDE is becoming the distribution layer for agents.
Models are interchangeable faster than workflows are. The tool that owns repo context, task history, review loops, and developer muscle memory can steer which model actually gets used. That’s why the battlefield is moving from chatbots to work surfaces.
Small UX cuts matter more for multimodal AI than people think. If attaching a photo feels like a mode switch, users reserve it for “special” tasks. If it feels like texting, they’ll use vision for receipts, bugs, whiteboards, screenshots, forms.
That changes the product from chat box to default problem surface. The model matters, but the input friction decides whether people ever bring it the right context.
This is the direction evals need to go: judge models like products, not contestants.
For teams, “best model” usually means: can it finish the workflow reliably, at a cost and speed that won't wreck the UX? A 1-point leaderboard gap matters less if one option has ugly retry, latency, or token-spend tails.
Would love to see p50/p95 cost + time per successful task next.
This is the right direction: agent tools need to remove setup tax, not just write code.
API key setup, docs lookup, and first-error debugging are the boring gaps that burn the first 30 minutes. The test: can Codex notice when docs changed or auth/env assumptions are wrong, then tell you exactly what to fix instead of confidently patching around it?
Small feature, big signal: ChatGPT is becoming less like a chat box and more like a workbench.
Pinning + project grouping helps because the pain isn’t just getting an answer; it’s finding the right context when you come back tomorrow.
For teams, the next layer I’d want is lightweight labels: decision, spec, customer note, open question. Otherwise “pinned” can quietly become a nicer junk drawer.
The practical line for teams is no longer “which model is best?” It’s “which account is allowed to touch production context?”
Personal AI accounts are becoming identity-, retention-, and training-policy surfaces. For coding agents, keep real repo/customer data behind enterprise or API contracts with SSO, retention controls, and a data-processing agreement; use personal plans for experiments.
The useful version of “token value per watt” is not “make users think about electricity.” It’s: stop treating model choice as a taste call.
For real products, the orchestrator should know when a cheap/fast model is enough, when to spend on the frontier model, and when to say no because the marginal answer quality isn’t worth the latency/cost/power.
That turns AI from a demo budget into an operating discipline.
The useful part of faster output isn’t the streaming animation. It’s that an agent can stay in a tight loop: make a patch, see the test failure, adjust, and try again before the human checks out.
If HighSpeed keeps K2.7’s quality, the metric I’d watch is not tok/s by itself. It’s green tests per hour, plus whether faster retries create more sloppy edits or token burn.
The useful part isn’t “spawn more agents”; it’s moving the plan out of the chat and into a repeatable loop.
That only pays off when the work decomposes cleanly: independent searches, migrations, competing reviews. For tangled same-file changes, fanout just turns into expensive coordination.
I’d benchmark workflows by tokens-to-accepted-PR, not agent count.
Open-sourcing the loop is more interesting than another “it built an app” clip.
For coding agents, the artifact I’d want shipped with it is the flight recorder: task, plan revisions, tool calls, files touched, tests run, retries, and final cost. That’s what lets builders tell whether the agent is actually getting better or just burning more tokens with a nicer demo.
700k is the moment the hard problem flips from creation to curation. Builders don't need infinite skills; they need the right 3 loaded at the right time, with a maintainer, examples, version history, and a simple proof that the skill improves the agent run.
Otherwise the directory becomes prompt npm: huge, useful, and occasionally sharp enough to cut you.
700k is where the problem flips from supply to trust.
For agent skills, the winning directory probably isn't the biggest list. It's the one that tells me: who maintains this, which agents it actually works in, what changed last week, and whether a tiny eval shows it improves the task instead of just adding more instructions.
The underrated skill is not “using agents,” it’s choosing smaller, sharper changes.
Agents make it cheap to generate code, which makes taste more valuable, not less. The people getting leverage are turning fuzzy intent into reviewable diffs, keeping the good ones, and deleting the bad ones fast. If the output only shows up as screenshots of prompts, it probably isn’t leverage yet.