Token leaderboards are not a great idea.
Now we're hearing about budget freezes and per-employee token limits.
Both are the same mistake.
Treating AI as one number on the budget is like treating all your salary spend as one number. Nobody does that. A staff engineer and a contractor aren't the same line, and you'd never set one headcount budget and call it strategy.
But that's exactly how most teams are handling AI right now. One bucket. One number to either brag about or panic over.
The value AI adds to shipping a feature that moves your top line is nothing like the value it adds writing cold email copy — or the value it adds when you're burning tokens recreating a tool you could have bought.
Lumping all of that together as "AI spend" is how you end up making bad decisions in both directions: tokenmaxxing to inflate the number, then slamming on the brakes when the bill scares you.
We've watched this play out. We've talked to teams blocked from spending on a tool that would solve a real problem for 1-2 orders of magnitude less than the salaries they're already paying to solve it by hand — because the "AI budget" was frozen.
The worst part is what the freeze kills: experimentation. When you cut that, you're betting the technology around you isn't changing. It is.
AI isn't a line item. It's a set of very different bets, each tied to a different part of your business. Start budgeting it that way: https://t.co/mcAvpUEm1x
Visual language models (VLMs) are surprisingly bad at comparative visual reasoning - detect the difference type tasks needed in medicine and science.
We just made VLMs stateful by post-training cross attention between visual encoder layers.
Our approach can be bolted on existing frontier models.
👀Humans compare images by looking back and forth. Many open-weight VLMs encode each image independently, and defer comparison to the LM.
We introduce SVE: Stateful Visual Encoders for Vision-Language Models, where the visual encoder itself becomes change-aware.
🌐Project: https://t.co/P1ASxE5VBE
📰Paper: https://t.co/XnPbAF3Zr2
💻Code: https://t.co/TEX5T3SLmy
1/n
Amazing work! More and more RL frameworks are using vLLM as default. @vllm_project along with @anyscalecompute and @NovaSkyAI revamped weight syncing and improved wide-ep deployment for rollout!
Last week, a VP of Engineering told us his team had already built most of what we do.
We'd spent four meetings with the engineers who actually run reliability for him. We knew where their data lived, what systems they used, how they triaged incidents. So we knew the team was working almost entirely by hand — getting paged, spinning up channels, linking tickets manually.
He thought they were further along than they were. We knew more about the state of his org than he did.
This isn't new. Leaders have always been a few steps removed from the work. Enterprises have grown for decades despite that gap. But AI makes the gap expensive in a way it never was before.
Here's why: an agent only creates value when it conforms to how your team actually works. That requires a clear picture of the current state and a real definition of what "good" looks like. If you don't know what good is, an LLM won't tell you — it'll just help you get to the wrong place faster.
The VP was anchored on a demo built on hard-coded workflows and runbooks. It looked impressive. It also wouldn't survive contact with the complexity of his actual stack — a stack he'd partly lost track of. So the risk wasn't that he'd buy nothing. It's that he'd buy something that amplified the disorganization already there.
The fix isn't more diligence from the top. It's trusting the people doing the work to make the call. They're the ones living with the pain. They know what's worth automating and what good looks like, because they're standing in the ground truth every day.
The further you sit from that ground truth, the more likely you are to believe your team is light-years ahead — or behind — where it actually is.
That's the part AI doesn't fix for you: https://t.co/tLqJczGo4O
Lots of exciting news to share today!
1. @RunLLM is now @Herald_Dev. The new name reflects the fact that our AI SRE is the only product on the market that operates autonomously — teaching itself about your product & infra, detecting early warning signs of incidents, and investigating without runbooks. Read more: https://t.co/7GjY5bmsoh
2. Herald was named to the InfraRed 100, an annual list recognizing the most promising private companies defining the future of cloud infrastructure. Thanks to Redpoint for the recognition!
3. We're releasing the beta of the Herald CLI — an agent that runs securely on your laptop and gets up and running in minutes. Sign up for early access here: https://t.co/fbv3byCx2L
We release Recon — a new approach to reasoning synthesis for user modeling.
The key insight: post-hoc rationalization ≠ reasoning.
We propose using action reconstruction as a scoring criterion for synthesized reasoning traces, yielding more causally faithful reasoning and improved downstream action prediction across user modeling tasks.
Paper and project page in 🧵
Open-ended coding training data may no longer be the bottleneck: AI can scale open-ended tasks—and even outperform human-expert curation.
FrontierCS team is releasing FrontierSmith: a system for synthesizing open-ended coding problems at scale. Starting from closed-ended coding tasks, FrontierSmith mutates, filters, and builds runnable optimization environments for long-horizon coding agents. In our experiments, FrontierSmith data trains stronger models than human-curated open-ended data on FrontierCS and ALE-bench.
Blog: https://t.co/mhdDsBnfTQ
Paper: https://t.co/4CDVvNGZZ4
Code: https://t.co/90FjTjAjnv
Model: https://t.co/Mf5qalg4Ll
Great reviewers are the essence of a great conference and a strong academic community!
I am especially excited to hear that one of my students, @tsunghan_wu, was recognized as a Gold Reviewer at @icmlconf.
I hope we start recognizing great reviewers at conferences.
Got the award I wanted most - TYSM ICML 🥹
Now that AI can write code, run exps, and draft papers, I feel that researchers are entering a new era:
less "I made this model training work"
more "wait… is this result even real?"
Verification, taste, and management are becoming the new superpowers.
I’ll keep trying to be a better reviewer ;)
Finding product-market fit has always been the holy grail for every startup.
In AI, it might not be the "we've made it" moment it once was.
The traditional advice once you find PMF is to operationalize. Codify the ICP. Build the playbooks. Deepen the product. The point is consistency — $N in, $M out.
In AI, consistency is a liability.
Customer preferences are being rebuilt every week. The demo they saw last night is the new benchmark. If that signal takes three weeks to travel from a sales call back to a roadmap decision, you're already behind.
The companies that win aren't going to be the ones that find PMF first.
They're going to be the ones that keep replacing their own product while the market is still figuring itself out: https://t.co/Zg9ICCbLRQ
Can LLMs adapt continually without losing base skills?
Fast-Slow Training (FST) pairs "slow" weights with "fast" context.
FST vs. RL:
• 3x more sample-efficient
• Higher performance ceiling
• Less KL drift (better plasticity)
• Continual learning: succeeds where RL stalls
If customers had been willing to write us $250K checks on day one, we would have built the wrong product.
With RunLLM, we set out to build the same AI SRE agent everyone else was building: an RCA agent triggered by alerts, driven by customer-maintained runbooks. It was the obvious answer. Humans use runbooks, so the agent should too.
Except alert thresholds are noisy. Nobody actually maintains their runbooks. And the agent inherits every gap.
We didn't figure that out because we were smarter than anyone else. We figured it out because the market gave us time. Enterprise SRE buyers don't move fast. They have committees. They want weeks to evaluate. They ask hard questions about what happens when something breaks at 3am.
That slowness is put us on the right track.
In a fast market, the competitive pressure forces you to ship the obvious solution and iterate from there. You don't get time to ask whether you're solving the right problem — you just have to start solving something. In a slow market, you're forced to keep asking. And for hard problems, the obvious solution is rarely the right one.
The interesting question in AI SRE isn't "how do we automate the runbook." It's "how do we detect early warning signs, validate them, and find root cause before any threshold alert fires?" We didn't get to that question by moving fast.
We got to it because the market wouldn't let us.
I see a lot of founders right now benchmarking themselves against Cursor's growth curve and feeling like something is wrong. For most infrastructure problems worth solving, that curve was never going to apply. And the slowness you're frustrated by is probably the thing that's going to make your product impossible to copy in three years.
Friction is information. Don't optimize it away too early: https://t.co/CZR9GFmwdZ
Today I’m excited to congratulate @simon_mo_ on an outstanding PhD thesis defense on his work exploring the design of Inference Serving Systems. 🎉
Simon has been working on inference systems with me for nearly a decade -- long before most people even considered inference serving a research problem worth studying.
Over that time, he helped drive inference systems projects spanning Clipper, @raydistributed Serve, and now @vllm_project. Together, these systems helped define the modern inference serving stack that powers today’s AI applications.
Beyond being an exceptional researcher, Simon has also been a remarkable team and community builder, especially through his leadership on vLLM and the open-source ecosystem around it.
Along with my colleagues @istoica05 and @koushik77, I am excited to see Simon leading @inferact as CEO and helping shape the future of inference systems and AI infrastructure.
Congratulations, Simon!
For everyone staying up late to make NuerIPS -- I get it, we have all been there (still there?). However, real impact comes from doing great research, not conferences. Some of my most influential papers were published on arXiv.
So if you are feeling overwhelmed, you are not alone ... but also remember it is better to give research the time it needs to be great research and not just another publication.
Back to working on NeurIPS papers. 🤷♂️
Such a great evening to start a brand new research for NeurIPS in 3.5 days.🧘♂️
Day 1: planning.
Night 1: running experiments and sending the abstract.
Day 2: reading results fighting with Claude, and sending again.
Night 2: sleep (optional).
Day 3: opening Codex, and finally, write the pape in parallel.
Night 3: resolving the “beef” with Claude (temporary peace) and going to sleep.
Day 4: final reading, last-minute fixes, submission then some relaxation, maybe a beach walk.
I’ll keep you posted on the results.
This will be my only single-author paper, so I can’t hide behind other submissions if it gets rejected 😅
There is a lot of hype around continual learning, but what is it and how do we evaluate it?
With our new continual learning bench we sought to answer both of these questions. We developed a new methodology for designing continual learning tasks and a growth-based learning metric to isolate continual learning.
Have you experienced models (agent loops) rapidly improving on your tasks? Do you have tasks that could benefit from continual learning? Let us know.
Today, we’re releasing Continual Learning Bench 1.0: the first, realistic benchmark for measuring how AI systems can improve in online settings.
Benchmarks today assume models are stateless. Each example is independent, and once a system finishes a task, it moves on as if nothing happened.
But deployed AI systems should learn from experience. We tested 10+ frontier systems against novel, expert-validated tasks and find there’s still plenty of headroom for learning. (1/n)
Someone not bragging about a better number but instead reflecting on how we talk about things and where the field is headed. Thought leadership!
We need more of this!
AI agents shouldn't have a job title.
The entire AI industry is racing to build "AI SDRs," "AI SREs," and "AI SOC analysts." You can't walk through SF without seeing a billboard for one.
We get why — customers search for these terms, and if your site doesn't speak their language, you lose the SEO battle before you make your pitch.
But here's the problem: when you name your agent after a job title, you're promising it can do everything that person does. Including the stuff that never made it into the job description.
The result is mismatched expectations, eroded trust, and products that underdeliver on their own marketing.
Meanwhile, the agent category with the deepest adoption, the strongest data flywheels, and the most widespread quality? Coding agents. And none of them called themselves an "AI software engineer."
That's not a coincidence.
The full post explains why job title thinking constrains what an agent can actually do: https://t.co/W8Sv5tQOPM
We spent $63 on a single investigation last month.
That number stuck with me, because it's the cleanest illustration I've seen of where AI economics are actually heading.
Per-token costs are plateauing. But per-request token consumption is going up — fast. Every time we add another LLM call to pre-read data, rerank results, or evaluate relevance, the bill goes up. And we keep adding them, because that's how you actually get good answers.
The honest truth: we have a dozen more places we'd love to throw an LLM at the problem. We're held back by cost, latency, and evals — not by ideas.
Most teams are reaching for fine-tuning or RL to fix this. I'd push back. The hard part of post-training isn't the algorithm. It's having the right data in the right shape, and most teams don't.
The boring lever almost no one pulls hard enough: matching model size to task difficulty.
Gating questions, filtering documents, synthesizing logs — none of these need a frontier model. A smaller model handles them fine, at a fraction of the cost. We default to GPT-4.1 Mini for a lot of these, and it's been one of the highest-leverage decisions we've made.
There's no clean rule for when to use what. It's still more art than science. But if you're not actively making that call, you're paying for it.
Wrote more about how we think about managing token demand here: https://t.co/YKhM60JOOM