Token leaderboards are not a great idea.
Now we're hearing about budget freezes and per-employee token limits.
Both are the same mistake.
Treating AI as one number on the budget is like treating all your salary spend as one number. Nobody does that. A staff engineer and a contractor aren't the same line, and you'd never set one headcount budget and call it strategy.
But that's exactly how most teams are handling AI right now. One bucket. One number to either brag about or panic over.
The value AI adds to shipping a feature that moves your top line is nothing like the value it adds writing cold email copy — or the value it adds when you're burning tokens recreating a tool you could have bought.
Lumping all of that together as "AI spend" is how you end up making bad decisions in both directions: tokenmaxxing to inflate the number, then slamming on the brakes when the bill scares you.
We've watched this play out. We've talked to teams blocked from spending on a tool that would solve a real problem for 1-2 orders of magnitude less than the salaries they're already paying to solve it by hand — because the "AI budget" was frozen.
The worst part is what the freeze kills: experimentation. When you cut that, you're betting the technology around you isn't changing. It is.
AI isn't a line item. It's a set of very different bets, each tied to a different part of your business. Start budgeting it that way: https://t.co/mcAvpUEm1x
Customer Case Study: @SnorkelAI hits 93% accuracy with Herald, no runbooks. Their demo of Herald won the team Snorkel's internal Engineering Excellence Award.
Snorkel AI connected four data sources to Herald and gave it nothing else. No runbooks. No historical root cause analyses. No Slack history.
Herald ingested all four in hours and returned its first correct root cause analysis within two days.
Their PM ran the agent through 90 real engineering tasks. It got 93% right.
In the first two weeks live in production, 27 engineers root caused 52 incidents.
Most #AIDevOps agents come with a setup tax. Weeks of writing runbooks, tagging incidents, and building a knowledge base before you see anything useful. Snorkel skipped all of it.
Kartik Mathur, Director of Engineering at Snorkel AI, co-authored the case study with Herald CEO @vsreekanti.
👉 Read the full case study: https://t.co/nuMpg06EHW
Last week, a VP of Engineering told us his team had already built most of what we do.
We'd spent four meetings with the engineers who actually run reliability for him. We knew where their data lived, what systems they used, how they triaged incidents. So we knew the team was working almost entirely by hand — getting paged, spinning up channels, linking tickets manually.
He thought they were further along than they were. We knew more about the state of his org than he did.
This isn't new. Leaders have always been a few steps removed from the work. Enterprises have grown for decades despite that gap. But AI makes the gap expensive in a way it never was before.
Here's why: an agent only creates value when it conforms to how your team actually works. That requires a clear picture of the current state and a real definition of what "good" looks like. If you don't know what good is, an LLM won't tell you — it'll just help you get to the wrong place faster.
The VP was anchored on a demo built on hard-coded workflows and runbooks. It looked impressive. It also wouldn't survive contact with the complexity of his actual stack — a stack he'd partly lost track of. So the risk wasn't that he'd buy nothing. It's that he'd buy something that amplified the disorganization already there.
The fix isn't more diligence from the top. It's trusting the people doing the work to make the call. They're the ones living with the pain. They know what's worth automating and what good looks like, because they're standing in the ground truth every day.
The further you sit from that ground truth, the more likely you are to believe your team is light-years ahead — or behind — where it actually is.
That's the part AI doesn't fix for you: https://t.co/tLqJczGo4O
Herald has been named to the The InfraRed 100, recognizing the top private companies defining the future of cloud infrastructure.
Thank you to the @Redpoint team for the recognition.
Want to try Herald? We now offer Herald CLI — completely free, full-featured, and up-and-running securely from your terminal in minutes.
Join the waitlist: https://t.co/GvIS3tgBXd
See the full list of 2026 InfraRed 100 honorees and accompanying industry report: https://t.co/1l33BHqmYL
RunLLM is now Herald. There's a story behind it, and it starts with a moment every engineer knows.
It's 3am, an alert fired, and you're looking at something you've never seen before, a novel incident. Your runbooks don't cover it. Your tools tell you something is wrong, but none can say why or how, or what to do about it.
For decades, that's been the deal. Something breaks, you fix it fast. Companies are pretty good at it. But it comes with real costs — alert fatigue, engineer burnout, and the moments when customers tell you something is down before you know it yourself.
So we asked a different question: what if you didn't have to wait for something to break? What if your systems could tell you what's about to go wrong, before alerts fire, before customers notice?
To herald something is to signal that it's about to happen. And that's the shift we're bringing to observability and reliability: from t₊₁ to t₋₁, where t is the moment something breaks.
To deliver on this promise, we're offering Herald CLI — a full-featured, completely free agent that runs securely on your laptop and gets up and running in minutes. Try it on your own stack to see how Herald moves you from always being behind problems to getting ahead of them.
👉 Sign up for Herald CLI early access here: https://t.co/abnNB62Kdy
👉 Read more about the Herald brand from our CEO @vsreekanti: https://t.co/gyE3a06xHO
Lots of exciting news to share today!
1. @RunLLM is now @Herald_Dev. The new name reflects the fact that our AI SRE is the only product on the market that operates autonomously — teaching itself about your product & infra, detecting early warning signs of incidents, and investigating without runbooks. Read more: https://t.co/7GjY5bmsoh
2. Herald was named to the InfraRed 100, an annual list recognizing the most promising private companies defining the future of cloud infrastructure. Thanks to Redpoint for the recognition!
3. We're releasing the beta of the Herald CLI — an agent that runs securely on your laptop and gets up and running in minutes. Sign up for early access here: https://t.co/fbv3byCx2L
Why do on-call engineers often ignore expensive AI tools during incidents? I wrote about what's broken and what it takes to fix it: https://t.co/YndwqKzvxC
Finding product-market fit has always been the holy grail for every startup.
In AI, it might not be the "we've made it" moment it once was.
The traditional advice once you find PMF is to operationalize. Codify the ICP. Build the playbooks. Deepen the product. The point is consistency — $N in, $M out.
In AI, consistency is a liability.
Customer preferences are being rebuilt every week. The demo they saw last night is the new benchmark. If that signal takes three weeks to travel from a sales call back to a roadmap decision, you're already behind.
The companies that win aren't going to be the ones that find PMF first.
They're going to be the ones that keep replacing their own product while the market is still figuring itself out: https://t.co/Zg9ICCbLRQ
If customers had been willing to write us $250K checks on day one, we would have built the wrong product.
With RunLLM, we set out to build the same AI SRE agent everyone else was building: an RCA agent triggered by alerts, driven by customer-maintained runbooks. It was the obvious answer. Humans use runbooks, so the agent should too.
Except alert thresholds are noisy. Nobody actually maintains their runbooks. And the agent inherits every gap.
We didn't figure that out because we were smarter than anyone else. We figured it out because the market gave us time. Enterprise SRE buyers don't move fast. They have committees. They want weeks to evaluate. They ask hard questions about what happens when something breaks at 3am.
That slowness is put us on the right track.
In a fast market, the competitive pressure forces you to ship the obvious solution and iterate from there. You don't get time to ask whether you're solving the right problem — you just have to start solving something. In a slow market, you're forced to keep asking. And for hard problems, the obvious solution is rarely the right one.
The interesting question in AI SRE isn't "how do we automate the runbook." It's "how do we detect early warning signs, validate them, and find root cause before any threshold alert fires?" We didn't get to that question by moving fast.
We got to it because the market wouldn't let us.
I see a lot of founders right now benchmarking themselves against Cursor's growth curve and feeling like something is wrong. For most infrastructure problems worth solving, that curve was never going to apply. And the slowness you're frustrated by is probably the thing that's going to make your product impossible to copy in three years.
Friction is information. Don't optimize it away too early: https://t.co/CZR9GFmwdZ
We spent $63 on a single investigation last month.
That number stuck with me, because it's the cleanest illustration I've seen of where AI economics are actually heading.
Per-token costs are plateauing. But per-request token consumption is going up — fast. Every time we add another LLM call to pre-read data, rerank results, or evaluate relevance, the bill goes up. And we keep adding them, because that's how you actually get good answers.
The honest truth: we have a dozen more places we'd love to throw an LLM at the problem. We're held back by cost, latency, and evals — not by ideas.
Most teams are reaching for fine-tuning or RL to fix this. I'd push back. The hard part of post-training isn't the algorithm. It's having the right data in the right shape, and most teams don't.
The boring lever almost no one pulls hard enough: matching model size to task difficulty.
Gating questions, filtering documents, synthesizing logs — none of these need a frontier model. A smaller model handles them fine, at a fraction of the cost. We default to GPT-4.1 Mini for a lot of these, and it's been one of the highest-leverage decisions we've made.
There's no clean rule for when to use what. It's still more art than science. But if you're not actively making that call, you're paying for it.
Wrote more about how we think about managing token demand here: https://t.co/YKhM60JOOM
Agents can't choose between structure and flexibility.
We learned this the hard way. In the early days of RunLLM, we built the way most AI SRE vendors still build: have customers write runbooks, encode them as workflows, let the agent execute them in response to alerts.
It worked in demos. It fell apart in production.
The moment an alert looked different from anything we'd seen before, the agent was useless. The moment a customer's architecture changed, the runbook was stale. We were shipping a glorified lookup table and calling it an agent.
The instinct is to flip the other way. Let the model figure it out. Give it good context, a capable loop, and get out of the way.
That works until you try to run it at scale. Context windows fill up and something has to decide what to keep. Costs balloon and something has to route cheaper tasks to cheaper models. Multiple agents need to coordinate and something has to orchestrate them. Each of those is an engineering decision that can't be solved by asking the model nicely.
The teams building serious agents have all landed in the same place, independently: structure where it has to be enforced, flexibility where reasoning matters, and a deliberate architecture deciding which is which.
Picking a side is how you avoid doing that work.
New post on the AI Frontier this week on why the Python vs. Markdown debate is the wrong debate: https://t.co/sbIO75lpW1
A VP of Construction Engineering. That's who our AI-powered SDR was emailing last week.
We're a developer tools company. Construction engineering is not in our ICP. But the agent saw "VP of Construction Engineering" and decided it was close enough to "VP of Engineering."
A human would catch that instantly. The agent couldn't, because it was built the way most agents are built today: take a human workflow, write down the steps, and hand each one to an LLM.
That works when everything fits the expected pattern. It falls apart the moment anything requires judgment.
I keep seeing the same mistake across the industry. Agents that try to generate a finished slide deck from a prompt. Agents that try to write and send entire email sequences autonomously. Agents that present themselves as replacements for the human rather than tools that make the human better.
The best agent products I've used don't work that way. They keep the scope narrow, the feedback loops fast, and the human in the loop on the decisions that require taste.
When the cost of generating work is zero, taste is what stands out. The agents that win are the ones designed to let humans apply it.
We wrote about this — and what we think "agent-native" actually means — in the first post of our new series: https://t.co/CKWblY0wqi
Ask Claude to build you a financial model in Excel.
You'll get back reasonable structure, plausible assumptions, formulas that link together correctly.
Now you have to check it.
Do you open every cell and inspect every formula? If you do that, you might as well have built it yourself. If you don't, you're trusting a junior employee who works at superhuman speed but might have encoded some very strange assumptions that didn't stand out at first glance.
Validating agent-generated work is the problem nobody is talking about.
Agents have made creation cheap. They haven't made it any easier to know whether what was created is actually right.
The bottleneck used to be writing the code, building the model, drafting the document. Now it's checking the output. And our tools — spreadsheets, code review, document editors — were all designed for a world where humans did the creating. None of them are built for the volume or the speed agents produce at.
@profjoeyg and I wrote about this, and what we think validation actually has to look like going forward: https://t.co/9JSKvoxT52
AI agents shouldn't have a job title.
The entire AI industry is racing to build "AI SDRs," "AI SREs," and "AI SOC analysts." You can't walk through SF without seeing a billboard for one.
We get why — customers search for these terms, and if your site doesn't speak their language, you lose the SEO battle before you make your pitch.
But here's the problem: when you name your agent after a job title, you're promising it can do everything that person does. Including the stuff that never made it into the job description.
The result is mismatched expectations, eroded trust, and products that underdeliver on their own marketing.
Meanwhile, the agent category with the deepest adoption, the strongest data flywheels, and the most widespread quality? Coding agents. And none of them called themselves an "AI software engineer."
That's not a coincidence.
The full post explains why job title thinking constrains what an agent can actually do: https://t.co/W8Sv5tQOPM