We do not need more AI agents. We need more control over them.
That is why I created Invoker, an open-core execution engine for AI-driven engineering workflows.
The bottleneck is no longer just code generation. It is execution control.
AI work needs isolation, replay, auditability, recovery, and human decision points. The model resembles build systems and workflow engines more than it resembles a theoretically “AI-first” chat interface.
https://t.co/JkWx9kdQMH
@antoniogm You didn’t just write a book on what happens to mediocre people who fail upwards, you also managed to turn it into a corporate survival guide in big tech.
Kind of impressive how the people archetypes you’ve described still exist today.
That's not a problem if your hit rate and payoff is good.
Think of it like a market making operation. You can score consistent wins by taking on the easy low-brow obvious orders and routing.
But if you ever hit a black swan or maybe the free model wasn't good you incur the expensive hit.
The spread between constantly using codex vs not using codex and having to call it is your profit margins.
Minus an accounting "fee/reserve/penalty" for inconveniencing the user who says "your response sucks, use Codex to fix it"
@FactoryAI
Dynamic model routing products have largely been snake oil so far. We’ve seen many come and go since 2022.
The story of model routing has a simple, legible quality that magnetizes capital.
@Alfred_Lin’s “Beware of Simple Narratives” speaks to the danger of this: https://t.co/STqtEeQP0z
I’m an engineer who has been working on genAI applications since 2022. The nuanced reality is very different from the simple story:
1. As @sqs points out below, frontier models are often better, faster AND cheaper—because they don’t have to retry or get stuck in reasoning loops. The gains of cost-optimized routing are often minimal. Also: people generally want the best possible output. People want to pay 20% more for 5% better.
2. Many projects take a concert of tightly bound models and prompts to complete well. You don’t want individual tasks being routed to different models, as it makes a system unpredictable and unstable. You care about the performance of the aggregate system much more than individual task performance. Dynamic task routing makes it hard to measure the system as a whole.
3. As a user, I dislike how model routing makes software feel opaque. I want to be able to get a “feel” for each model and how to best use it. I don’t want to use a system where changing one word of my prompt might cause me to get routed to a different model, getting wildly different results.
4. Foundation model APIs are already doing model routing to some extent. If there is a significant model arbitrage opportunity which can save costs, they can close the arbitrage themselves.
AI is a paradigm shift that exposes the managers who are engaged and who are not. I've seen this first hand.
Many managers are not good. They are disengaged and don't understand or can explain the technical or strategic importance of the projects they govern.
In the ZIRP era of remote work, many managers skated by simply by hiding behind empire builders and other disengaged managers.
Most managers manage optics not results. And organizations are polluted by incentives that reward ignorance of these problems, whether through polite fictions, alliances, or "alignment."
AI is a paradigm shift that tests managers who became managers through politics and hiding problems instead of those who actually have a technical understanding, faced similar problems, and have the competence to lead teams through through the unknown.
When its time to actually lead and inspire, many manager simply cannot do it. Even though that is their job.
lol sometimes I swear these companies just rip off my blog or something.
It looks like for remote execution , you need durable and persistent workflows! Wait until they find out about hermetic execution and incremental builds.
Temporal, airflow, and Bazel. You can steal all 3 of these and basically get the answer. That’s at least what I did.
Self healing? Splitting of tasks? Durable execution?
It sounds like @cursor has been reading the exact same issues I’ve been calling out in my Invoker.
A great cloud agent experience involves a lot more than moving a local agent to a server.
We've learned that it requires a durable execution platform, a powerful harness, and the tools and infra to give agents realistic development environments.
https://t.co/3xb2kGUjFd
I've been thinking about the exact same question: what does workflow look like for a swarm of agents?
My answer is something that looks more like Airflow or Dagster. By doing this you can chain workflows and parallelize tasks like build systems.
Devin's ability to route dynamically based on pricing/tuning seems very interesting. I think the value proposition is actually in that and less on being dependent on an IDE to jump in when needed.
In my initial tests, for long workflows that are broken up, I've cut costs by about 15%.
Let me know if you want to compare notes.
@atelicinvest Claude prices ~$0.30/M cache read, ~$3.75/M cache write, ~$15/M output
I hit 95% cache hits myself. Assumign the rest are output, then you get $94k of cache hits and then $248k of writes.
I wrote 5 years ago that organizations should be buying well thought out execution. Not code.
This is more true with AI. Tokenmaxxing just revealed that organizations have never thought about developer productivity the right way. They see engineers as cost centers and factory line workers.
This is how you get these ridiculous “number of diffs” metrics.
A more controversial metric I have is velocity per incident.
How many projects did you ship per postmortems/incidents? Or size of all project impact / size of all incident impact.
Why? With AI, experimentation becomes cheap so you can eliminate a lot of near misses you otherwise would have been hurt by.
I dont necessarily agree that vibe coding at scale is the best way of using AI. Coding has never been the hard part of the job.
When you ship more you naturally create risks and points of failure. Organizations have routinely been reckless with velocity (which leads to this ridiculous talk about shipping) or been very scared of failure where they dont ship at all.
This metric enables AI to remove risks and already leans into pragmatic practices that are routinely denied for the sake of optics.
Finally it is more difficult to game because you’re playing a game of sizing odds in the unknown but knowable. That can only be done properly by leaders who have real battle scars.
You can ship many high impact projects but if you create multiple SEV 1, that’s just being reckless.
Eliminating hidden risks is a tax-free way to grow profits.
A penny saved is a penny earned.
I wrote 5 years ago that organizations should be buying well thought out execution. Not code.
This is more true with AI. Tokenmaxxing just revealed that organizations have never thought about developer productivity the right way. They see engineers as cost centers and factory line workers.
This is how you get these ridiculous “number of diffs” metrics.
A more controversial metric I have is velocity per incident.
How many projects did you ship per postmortems/incidents? Or size of all project impact / size of all incident impact.
Why? With AI, experimentation becomes cheap so you can eliminate a lot of near misses you otherwise would have been hurt by.
I dont necessarily agree that vibe coding at scale is the best way of using AI. Coding has never been the hard part of the job.
When you ship more you naturally create risks and points of failure. Organizations have routinely been reckless with velocity (which leads to this ridiculous talk about shipping) or been very scared of failure where they dont ship at all.
This metric enables AI to remove risks and already leans into pragmatic practices that are routinely denied for the sake of optics.
Finally it is more difficult to game because you’re playing a game of sizing odds in the unknown but knowable. That can only be done properly by leaders who have real battle scars.
You can ship many high impact projects but if you create multiple SEV 1, that’s just being reckless.
Eliminating hidden risks is a tax-free way to grow profits.
A penny saved is a penny earned.
@jchal I’m not so sure. Yes NYC is competitive but veggies are expensive and perishable. Inventory turnover needs to be fast.
Profitably operating a salad bar would most likely only work at scale?
@wishtcday I heard its very difficult to do business in China. In the US, if you have a verbal agreement, people will follow through.
In China, you can reneg the entire agreement no matter what if financial interests are ever broken even during the contract.
Is that true?