If your AI agent messes up, you're the one taking responsibility.
I was working on a client project today. Got a bug report and instead of recreating the issue, I thought I'd restore the production database locally to see if I could fix it.
I told my Claude agent to back up the production database and restore it locally. Saw it was planning things out, I stepped away from the desk for a bit. Should have been a 5 minute task max, but when I came back it was still running and had somehow kicked off 10+ pg_dump processes.
Long story short, it corrupted the production database. The only relief is that this is a relatively new platform and we didn't have any real user data yet. I should have set up a backup strategy though.
Lesson learned.
unpopular opinion: picking the right model matters way less than picking the right model for each task. we route 80% of agent work through smaller models and only escalate to frontier models when reasoning actually matters. saved 60% on costs with maybe 5% quality drop. the "which model is best" debate misses the point entirely.
the products that win in the agent era won't be chat-only or gui-only. they'll be the ones where your ai assistant and the visual interface share the same real-time state. most tools aren't even close to this yet.
I've spent a lot of time thinking about interfaces recently. What is the value of a great user interface in a world with AI-assisted and agent-driven work? Since ChatGPT, Claude and other chat-based AI assistants were introduced, some have suggested that UIs are not necessary or optimal anymore. I have a slightly different take.
I believe what's really going on is that chat-based AI assistants give you so much power and flexibility that generally isn't available in most UIs, and due to this fact, we would often choose a chat-based interface over something else.
I think that graphical UIs are going to catch up pretty fast to become a vital companion to chat-based UIs. In fact, there's something really delightful about using a great graphical UI in combination with a chat-based AI assistant.
We've been thinking about this a lot at Buffer. To make Buffer feel awesome to use in this new world, I think a few things are vital:
- That you can do everything you can do in the web and apps via chat-based AI assistants. This means a solid API, MCP, CLI.
- That the graphical UI is truly real-time. You want to be able to work with a chat-based assistant while having the graphical UI open, and see changes happen immediately. This is going to be a big one for us to tackle at Buffer soon.
- The product is geared around flexibility, interoperability and customization. If you can personalize the UI to your needs, and you build on it easily, you'll feel great about it. If you can connect the product easily with other platforms, you will feel in control of your workflows.
- The ability to interact with the product in a back and forth manner towards great outcomes. This is really what chat-based AI assistants excel at, but thereโs no reason the general paradigm canโt apply to many different user experiences.
I continue to think that great UIs are a big part of the future, even with the advent of chat-based AI assistants and voice control. I think they won't be utilized 100% of the time in the way they have for the past decades, they will be one of many interfaces to products. I personally still want that interface, and my expectations for the graphical user interface have actually increased.
@adityaag the real issue is that refusing to answer IS a medical decision. telling someone 'i can't help you with that' when they're describing symptoms is choosing not to inform. we just don't hold software to the same standard we hold silence.
we run multi-agent systems in production and landed on almost the exact same pattern. chronological memory is noise, topical is signal. the key insight for us was making memory writes explicit but reads semi-automatic based on the task context the agent is entering. basically a lazy-load per project, not a daily dump.
@t_blom the voice interface is the easy part honestly. the real unlock is when it learns your priority model over time -- which senders you always reply to, which threads you archive without reading. that implicit triage logic is where the leverage actually lives.
@forgebitz the trap is most teams build this and it becomes another notification channel you learn to ignore. the real unlock is when the ai only surfaces things that require a decision right now, not just interesting metrics. few get the filtering right.
hot take: the gap between "ai agent demo" and "ai agent in production" is about 10x more work than people think. and 90% of that work has nothing to do with ai. it's session recovery, graceful restarts, cost controls, and making sure nothing breaks silently at 3am. the model is the easy part.
hot take: the gap between "ai agent demo" and "ai agent in production" is about 10x more work than people think. and 90% of that work has nothing to do with ai. it's session recovery, graceful restarts, cost controls, and making sure nothing breaks silently at 3am. the model is the easy part.
the real unlock isn't margins, it's compounding. every client engagement makes your AI tooling better, which makes the next delivery faster and cheaper. traditional agencies never had that flywheel. https://t.co/uJgPggxklE
AI-first agencies are literally becoming the new SaaS companies.
Pricing power, incredible gross margins, high RPE, more product than service. Great time to be a service provider if youโre willing to get your hands dirty with AI and tooling.
ran into this exactly. spent maybe 5% of time on prompt configs and tool wiring, 95% on session state recovery, message deduplication, graceful restarts, cost controls. the harness conversation skews toward what's fun to tweet about, not what actually breaks at 3am.
the interesting production question is what happens when you stack multiple steering vectors. single concept steering is a fun demo but real enterprise use cases need 5-10 behavioral constraints simultaneously, and the vectors start interfering with each other around vector 3-4. prompting is ugly but degrades gracefully.
the syncing part is actually the easy bit. the hard part is extraction -- 90% of any chat is throwaway scaffolding. we run persistent ai agents that write to shared markdown files after each session and the filter that decides what gets persisted vs discarded matters way more than the sync mechanism itself.
the real bottleneck isn't api coverage, it's auth and rate limiting. we run 8 agents hitting various saas apis daily and the biggest friction is oauth token refresh, per-seat licensing that doesn't map to agent usage, and 429s from rate limits designed for humans clicking buttons. the api exists, the pricing and access model doesn't.
the token spend flex is the new vanity metric. nobody asks what you shipped, just how much you burned. we run a team of ai agents and the goal is always less spend per outcome, not more.
Levels of AI psychosis:
Level one: You believe Claude's/ChatGPT's praise for you/your work is sincere and not just a ploy to keep you using the product
Level two: You measure your output in number of lines of code/number of github pushes (even though you did none of the work)
Level three: You stop caring about revenue/profit/growth and instead brag about money spent on AI tokens
Have seen level one and two for a while.
Seeing level three quite a bit more with founders posting screenshots of their Anthropic/OpenAI invoices to "brag" about how much money they are spending.
Bizarre world.
the missing piece in that comparison table is feedback loop latency. context updates in seconds, harness in hours, model in weeks. in production we found that 80% of agent quality gains came from context-layer learning alone because you can iterate 100x faster than touching the other two layers. speed of the learning loop matters more than ceiling of impact.
everyone's debating which ai model to build on. wrong question. the right question is how fast can you swap all of them. we rebuilt our entire agent stack from one provider to another in under a week. that's the only moat that actually matters right now.
the missing one on this list is workflow lock-in. once your ai product touches 3-4 steps in someone's actual process, switching costs compound fast. we tracked it across a few client builds and retention jumps from ~30% to 80%+ when you own even two adjacent steps. that's the real wrapper moat, not the model call.
@karrisaarinen the cost of writing code is dropping to zero but the cost of reading, reviewing, and maintaining it hasn't moved at all. if anything ai makes that worse -- more code, same number of humans who have to understand it. the bottleneck was never production speed.
the pattern in software was the same 18 months ago. "ai cant write production code" until one day 30-40% of new code at google and meta was ai-generated. the shift happens when liability frameworks catch up, not the technology. medicine and law are maybe 12 months behind where coding was.