@thsottiaux Just love this response! I always hate the feedback loop that goes:
Me: “hey there might be a bug”
Recipient: “repro it otherwise you are wrong and I am ignoring you”.
I am keeping my sub just for this response!
When code becomes cheaper, product judgment becomes more valuable.
Agentic Engg cannot replace customer understanding, onboarding, trust, or human adoption.
Product development was never just about coding.
Full essay:
https://t.co/QBXq6ubQl6
The hardest part of making agents self-improve wasn’t getting them to write code.
It was getting failures to survive long enough to be learned from.
The loop is the product: failure -> artifact -> issue -> eval -> fix -> replay -> new evidence.
#AIAgents#AIEvals
Self-improving agents do not improve from autonomy alone.
They improve when failures leave evidence: broken invariants, approval edges, terminal states, replayable evals.
Otherwise it is still human memory with tools attached.
Full piece:
https://t.co/2CzNmZec08
@thsottiaux@Kappaemme1926 Codex mobile... it's too anchored to “my computer.”
But my docs are in Drive. My repos are on GitHub. My work is already online.
The phone should be enough to say: edit this doc, hunt this bug, monitor a thread, etc.
@thsottiaux I open ChatGPT mostly for Codex now. Codex mobile shouldn’t be a laptop remote. Ideas show up away from the desk, and repos/runners can live in the cloud. Let me turn a thought into a branch, prototype, sim run, or working diff. Hello world, Codex app? 🙂
@thsottiaux Your rest post was the nudge I needed to finally try /fast. I’d been avoiding it because of usage burn, but the reset made it easy to try on real eval/review-agent work. Fresh tokens give me way too much dopamine now 😂
been thinking a lot about “compounding” in agent systems.
the uncomfortable realization: my agents could fix issues once I named them, but they weren’t learning from their own behavior.
so the next loop I’m building is observability → artifacts → issues → evals → fixes.
@claudeai lol I think I accidentally built a version of this into my harness because Claude Code + Opus kept getting too optimistic on complex work. Guess I can delete some code now 😂
The dangerous AI eval result is not a red score.
It is a green score nobody can explain.
If the PM cannot say what behavior the eval measures, green becomes theater.
The team is not learning. It is just watching a dashboard.
Before PMs write evals, they need a real feel for the user job.
The pain. The expectation. The failure mode.
Without that intuition, evals become a black box the team delegates to someone else.
That is when measurement starts drifting away from product judgment.
If AI keeps making workflows, memory, and context easier to transfer, the old SaaS switching-cost playbook gets weaker: data lock-in, setup cost, personalization, etc.
So what creates durable retention?
Maybe it’s delegated trust built through repeated delegations.
A colleague made a casual comparison:
AI subscriptions are starting to feel like streaming services.
Gemini, ChatGPT/Codex, Claude as Netflix, Disney+, Apple TV+.
The latest hot show wins the month. The others get cancelled until the next launch 😂
Claude Code can do mobile too, so mobile isn’t the whole story.
The real Codex unlock for me is the combo: no token anxiety, long running goals, and mobile as the judgment loop.
That changes the feeling from “remote coding” to “ambient product development.”
I still love Claude Code at work.
But Codex is becoming my personal project OS.
The unlock wasn’t just better coding. It was no token anxiety, mobile access, and agents that can keep moving until they need my judgment.