Joey Gonzalez

5 days ago

Token leaderboards are not a great idea. Now we're hearing about budget freezes and per-employee token limits. Both are the same mistake. Treating AI as one number on the budget is like treating all your salary spend as one number. Nobody does that. A staff engineer and a contractor aren't the same line, and you'd never set one headcount budget and call it strategy. But that's exactly how most teams are handling AI right now. One bucket. One number to either brag about or panic over. The value AI adds to shipping a feature that moves your top line is nothing like the value it adds writing cold email copy — or the value it adds when you're burning tokens recreating a tool you could have bought. Lumping all of that together as "AI spend" is how you end up making bad decisions in both directions: tokenmaxxing to inflate the number, then slamming on the brakes when the bill scares you. We've watched this play out. We've talked to teams blocked from spending on a tool that would solve a real problem for 1-2 orders of magnitude less than the salaries they're already paying to solve it by hand — because the "AI budget" was frozen. The worst part is what the freeze kills: experimentation. When you cut that, you're betting the technology around you isn't changing. It is. AI isn't a line item. It's a set of very different bets, each tied to a different part of your business. Start budgeting it that way: https://t.co/mcAvpUEm1x

Zirui "Colin" Wang @zwcolin

4 days ago

Visual language models (VLMs) are surprisingly bad at comparative visual reasoning - detect the difference type tasks needed in medicine and science. We just made VLMs stateful by post-training cross attention between visual encoder layers. Our approach can be bolted on existing frontier models.

5 days ago

👀Humans compare images by looking back and forth. Many open-weight VLMs encode each image independently, and defer comparison to the LM. We introduce SVE: Stateful Visual Encoders for Vision-Language Models, where the visual encoder itself becomes change-aware. 🌐Project: https://t.co/P1ASxE5VBE 📰Paper: https://t.co/XnPbAF3Zr2 💻Code: https://t.co/TEX5T3SLmy 1/n

247

214

50K

19K

profjoeyg retweeted

Simon Mo

@simon_mo_

11 days ago

Amazing work! More and more RL frameworks are using vLLM as default. @vllm_project along with @anyscalecompute and @NovaSkyAI revamped weight syncing and improved wide-ep deployment for rollout!

profjoeyg retweeted

Head of Security and Privacy Research @ Google DeepMind. @UCBerkeley security professor. MIT PhD. Co-founder of @OpaqueSys, @imua & @PreVeil.

11 days ago

Last week, a VP of Engineering told us his team had already built most of what we do. We'd spent four meetings with the engineers who actually run reliability for him. We knew where their data lived, what systems they used, how they triaged incidents. So we knew the team was working almost entirely by hand — getting paged, spinning up channels, linking tickets manually. He thought they were further along than they were. We knew more about the state of his org than he did. This isn't new. Leaders have always been a few steps removed from the work. Enterprises have grown for decades despite that gap. But AI makes the gap expensive in a way it never was before. Here's why: an agent only creates value when it conforms to how your team actually works. That requires a clear picture of the current state and a real definition of what "good" looks like. If you don't know what good is, an LLM won't tell you — it'll just help you get to the wrong place faster. The VP was anchored on a demo built on hard-coded workflows and runbooks. It looked impressive. It also wouldn't survive contact with the complexity of his actual stack — a stack he'd partly lost track of. So the risk wasn't that he'd buy nothing. It's that he'd buy something that amplified the disorganization already there. The fix isn't more diligence from the top. It's trusting the people doing the work to make the call. They're the ones living with the pain. They know what's worth automating and what good looks like, because they're standing in the ground truth every day. The further you sit from that ground truth, the more likely you are to believe your team is light-years ahead — or behind — where it actually is. That's the part AI doesn't fix for you: https://t.co/tLqJczGo4O

Who to follow

Raluca Ada Popa

@ralucaadapopa

Lianmin Zheng

@lm_zheng

Inference @meta | Prev: Engineer @xAI, Ph.D. @UCBerkeley, Co-founder @lmsysorg

Zhuohan Li

@zhuohan123

building @vllm_project at @meta | ex-openai | cs phd @ 🌁 uc berkeley | machine learning system | the real agi is the friends we made along the way

profjoeyg retweeted

13 days ago

Lots of exciting news to share today! 1. @RunLLM is now @Herald_Dev. The new name reflects the fact that our AI SRE is the only product on the market that operates autonomously — teaching itself about your product & infra, detecting early warning signs of incidents, and investigating without runbooks. Read more: https://t.co/7GjY5bmsoh 2. Herald was named to the InfraRed 100, an annual list recognizing the most promising private companies defining the future of cloud infrastructure. Thanks to Redpoint for the recognition! 3. We're releasing the beta of the Herald CLI — an agent that runs securely on your laptop and gets up and running in minutes. Sign up for early access here: https://t.co/fbv3byCx2L

24K

profjoeyg retweeted

Mihran Miroyan

@mirmiroyan

13 days ago

We release Recon — a new approach to reasoning synthesis for user modeling. The key insight: post-hoc rationalization ≠ reasoning. We propose using action reconstruction as a scoring criterion for synthesized reasoning traces, yielding more causally faithful reasoning and improved downstream action prediction across user modeling tasks. Paper and project page in 🧵

mirmiroyan's tweet photo. We release Recon — a new approach to reasoning synthesis for user modeling.

The key insight: post-hoc rationalization ≠ reasoning.

We propose using action reconstruction as a scoring criterion for synthesized reasoning traces, yielding more causally faithful reasoning and improved downstream action prediction across user modeling tasks.

Paper and project page in 🧵

10K

profjoeyg retweeted

Qiuyang Mang

@MangQiuyang

25 days ago

Open-ended coding training data may no longer be the bottleneck: AI can scale open-ended tasks—and even outperform human-expert curation. FrontierCS team is releasing FrontierSmith: a system for synthesizing open-ended coding problems at scale. Starting from closed-ended coding tasks, FrontierSmith mutates, filters, and builds runnable optimization environments for long-horizon coding agents. In our experiments, FrontierSmith data trains stronger models than human-curated open-ended data on FrontierCS and ALE-bench. Blog: https://t.co/mhdDsBnfTQ Paper: https://t.co/4CDVvNGZZ4 Code: https://t.co/90FjTjAjnv Model: https://t.co/Mf5qalg4Ll

335

350

94K

24 days ago

Great reviewers are the essence of a great conference and a strong academic community! I am especially excited to hear that one of my students, @tsunghan_wu, was recognized as a Gold Reviewer at @icmlconf. I hope we start recognizing great reviewers at conferences.

Patrick Wu

@tsunghan_wu

25 days ago

Got the award I wanted most - TYSM ICML 🥹 Now that AI can write code, run exps, and draft papers, I feel that researchers are entering a new era: less "I made this model training work" more "wait… is this result even real?" Verification, taste, and management are becoming the new superpowers. I’ll keep trying to be a better reviewer ;)

tsunghan_wu's tweet photo. Got the award I wanted most - TYSM ICML 🥹

Now that AI can write code, run exps, and draft papers, I feel that researchers are entering a new era:

less "I made this model training work"
more "wait… is this result even real?"

Verification, taste, and management are becoming the new superpowers.

I’ll keep trying to be a better reviewer ;)

profjoeyg retweeted

Kusha Sareen @KushaSareen

25 days ago

Finding product-market fit has always been the holy grail for every startup. In AI, it might not be the "we've made it" moment it once was. The traditional advice once you find PMF is to operationalize. Codify the ICP. Build the playbooks. Deepen the product. The point is consistency — $N in, $M out. In AI, consistency is a liability. Customer preferences are being rebuilt every week. The demo they saw last night is the new benchmark. If that signal takes three weeks to travel from a sales call back to a roadmap decision, you're already behind. The companies that win aren't going to be the ones that find PMF first. They're going to be the ones that keep replacing their own product while the market is still figuring itself out: https://t.co/Zg9ICCbLRQ

411

profjoeyg retweeted

27 days ago

Can LLMs adapt continually without losing base skills? Fast-Slow Training (FST) pairs "slow" weights with "fast" context. FST vs. RL: • 3x more sample-efficient • Higher performance ceiling • Less KL drift (better plasticity) • Continual learning: succeeds where RL stalls

KushaSareen's tweet photo. Can LLMs adapt continually without losing base skills?

Fast-Slow Training (FST) pairs "slow" weights with "fast" context.

FST vs. RL:
• 3x more sample-efficient
• Higher performance ceiling
• Less KL drift (better plasticity)
• Continual learning: succeeds where RL stalls https://t.co/kAxyDYfbPA

543

556

132K

profjoeyg retweeted

about 1 month ago

If customers had been willing to write us $250K checks on day one, we would have built the wrong product. With RunLLM, we set out to build the same AI SRE agent everyone else was building: an RCA agent triggered by alerts, driven by customer-maintained runbooks. It was the obvious answer. Humans use runbooks, so the agent should too. Except alert thresholds are noisy. Nobody actually maintains their runbooks. And the agent inherits every gap. We didn't figure that out because we were smarter than anyone else. We figured it out because the market gave us time. Enterprise SRE buyers don't move fast. They have committees. They want weeks to evaluate. They ask hard questions about what happens when something breaks at 3am. That slowness is put us on the right track. In a fast market, the competitive pressure forces you to ship the obvious solution and iterate from there. You don't get time to ask whether you're solving the right problem — you just have to start solving something. In a slow market, you're forced to keep asking. And for hard problems, the obvious solution is rarely the right one. The interesting question in AI SRE isn't "how do we automate the runbook." It's "how do we detect early warning signs, validate them, and find root cause before any threshold alert fires?" We didn't get to that question by moving fast. We got to it because the market wouldn't let us. I see a lot of founders right now benchmarking themselves against Cursor's growth curve and feeling like something is wrong. For most infrastructure problems worth solving, that curve was never going to apply. And the slowness you're frustrated by is probably the thing that's going to make your product impossible to copy in three years. Friction is information. Don't optimize it away too early: https://t.co/CZR9GFmwdZ

573

about 1 month ago

@sarahwooders It's a shame we never actually published it in an academic conference. You know we still have 8 hours ...

817

about 1 month ago

@JamesAlcorn94 @simon_mo_ @istoica05 Thanks James!!

281

about 1 month ago

Today I’m excited to congratulate @simon_mo_ on an outstanding PhD thesis defense on his work exploring the design of Inference Serving Systems. 🎉 Simon has been working on inference systems with me for nearly a decade -- long before most people even considered inference serving a research problem worth studying. Over that time, he helped drive inference systems projects spanning Clipper, @raydistributed Serve, and now @vllm_project. Together, these systems helped define the modern inference serving stack that powers today’s AI applications. Beyond being an exceptional researcher, Simon has also been a remarkable team and community builder, especially through his leadership on vLLM and the open-source ecosystem around it. Along with my colleagues @istoica05 and @koushik77, I am excited to see Simon leading @inferact as CEO and helping shape the future of inference systems and AI infrastructure. Congratulations, Simon!

profjoeyg's tweet photo. Today I’m excited to congratulate @simon_mo_ on an outstanding PhD thesis defense on his work exploring the design of Inference Serving Systems. 🎉

Simon has been working on inference systems with me for nearly a decade -- long before most people even considered inference serving a research problem worth studying.

Over that time, he helped drive inference systems projects spanning Clipper, @raydistributed Serve, and now @vllm_project. Together, these systems helped define the modern inference serving stack that powers today’s AI applications.

Beyond being an exceptional researcher, Simon has also been a remarkable team and community builder, especially through his leadership on vLLM and the open-source ecosystem around it.

Along with my colleagues @istoica05 and @koushik77, I am excited to see Simon leading @inferact as CEO and helping shape the future of inference systems and AI infrastructure.

Congratulations, Simon!

273

27K

about 1 month ago

For everyone staying up late to make NuerIPS -- I get it, we have all been there (still there?). However, real impact comes from doing great research, not conferences. Some of my most influential papers were published on arXiv. So if you are feeling overwhelmed, you are not alone ... but also remember it is better to give research the time it needs to be great research and not just another publication. Back to working on NeurIPS papers. 🤷‍♂️

Amit LeVi

@AmitLeViAI

about 1 month ago

Such a great evening to start a brand new research for NeurIPS in 3.5 days.🧘‍♂️ Day 1: planning. Night 1: running experiments and sending the abstract. Day 2: reading results fighting with Claude, and sending again. Night 2: sleep (optional). Day 3: opening Codex, and finally, write the pape in parallel. Night 3: resolving the “beef” with Claude (temporary peace) and going to sleep. Day 4: final reading, last-minute fixes, submission then some relaxation, maybe a beach walk. I’ll keep you posted on the results. This will be my only single-author paper, so I can’t hide behind other submissions if it gets rejected 😅

251

156

396K

152

18K

about 1 month ago

There is a lot of hype around continual learning, but what is it and how do we evaluate it? With our new continual learning bench we sought to answer both of these questions. We developed a new methodology for designing continual learning tasks and a growth-based learning metric to isolate continual learning. Have you experienced models (agent loops) rapidly improving on your tasks? Do you have tasks that could benefit from continual learning? Let us know.

Parth Asawa

@pgasawa

about 1 month ago

Today, we’re releasing Continual Learning Bench 1.0: the first, realistic benchmark for measuring how AI systems can improve in online settings. Benchmarks today assume models are stateless. Each example is independent, and once a system finishes a task, it moves on as if nothing happened. But deployed AI systems should learn from experience. We tested 10+ frontier systems against novel, expert-validated tasks and find there’s still plenty of headroom for learning. (1/n)

pgasawa's tweet photo. Today, we’re releasing Continual Learning Bench 1.0: the first, realistic benchmark for measuring how AI systems can improve in online settings.

Benchmarks today assume models are stateless. Each example is independent, and once a system finishes a task, it moves on as if nothing happened.

But deployed AI systems should learn from experience. We tested 10+ frontier systems against novel, expert-validated tasks and find there’s still plenty of headroom for learning. (1/n)

156

901

830K

about 1 month ago

Someone not bragging about a better number but instead reflecting on how we talk about things and where the field is headed. Thought leadership! We need more of this!

Hanchen Li

@lihanc02

about 1 month ago

https://t.co/WsdnWcEAhL

272

346

44K

profjoeyg retweeted

2 months ago

AI agents shouldn't have a job title. The entire AI industry is racing to build "AI SDRs," "AI SREs," and "AI SOC analysts." You can't walk through SF without seeing a billboard for one. We get why — customers search for these terms, and if your site doesn't speak their language, you lose the SEO battle before you make your pitch. But here's the problem: when you name your agent after a job title, you're promising it can do everything that person does. Including the stuff that never made it into the job description. The result is mismatched expectations, eroded trust, and products that underdeliver on their own marketing. Meanwhile, the agent category with the deepest adoption, the strongest data flywheels, and the most widespread quality? Coding agents. And none of them called themselves an "AI software engineer." That's not a coincidence. The full post explains why job title thinking constrains what an agent can actually do: https://t.co/W8Sv5tQOPM

497

profjoeyg retweeted

Herald

@Herald_Dev

2 months ago

https://t.co/c84WAWnJxg

541

profjoeyg retweeted