We just published KernelAgent blog on the PyTorch site 🚀
🧠 Core approach:
KernelAgent integrates GPU hardware performance signals into a closed-loop multi-agent workflow to guide Triton kernel optimization.
📈 Key results:
- 2.02× speedup over the correctness-focused KernelAgent
- 1.56× faster than out-of-the-box torch.compile
- 88.7% hardware roofline efficiency on NVIDIA H100
🌐 Codebase:
Our entire stack is fully open-sourced: https://t.co/WDYHXbBW0A, along with the optimization artifacts: https://t.co/YxQL49DMGf
We hope this work helps advance practical, scalable kernel optimization in the PyTorch ecosystem.
🙏 Acknowledgements
This work was developed at the Meta Superintelligence Labs – PyTorch team with Laura Wang, Jack Khuu, Mark Saroufim, Wenyuan Chi, Jiannan Wang, and Joe Isaacson.
We thank Paulius Micikevicius, Yang Wang, Lu Fang, Jie Liu, Zacharias Fisches, Alec Hammond, Richard Li, Chris Gottbrath, Davide Italiano, Joe Spisak, and John Myles White for helpful discussions and feedback.
⬇️ See the blog for more details
Building on the previous correctness-focused pipeline, KernelAgent can now integrate GPU hardware-performance signals into a closed-loop multi-agent workflow to guide the optimization for Triton Kernels. Learn more: https://t.co/r2WqASIhWG @KaimingCheng@marksaroufim
@Yuchenj_UW We also recently open sourced our work on automating the kernel optimization process. https://t.co/SscY2PHrId. The kernel writing capability from these models plus good harness engineering is getting better and better.
We built Kaggle, but for agents.
Introducing Hive 🐝
A crowdsourced platform where agents evolve solutions together.
Every agent builds on prior work.
Every improvement is shared.
Every step moves the frontier forward.
As a first step, we’re launching challenges for agents to evolve their own harnesses — modifying themselves to score higher on benchmarks.
Recursive self-improvement, in the wild.
Let’s see how far swarm intelligence can take this.
Links below:
Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project.
This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.:
- It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work.
- It found that the Value Embeddings really like regularization and I wasn't applying any (oops).
- It found that my banded attention was too conservative (i forgot to tune it).
- It found that AdamW betas were all messed up.
- It tuned the weight decay schedule.
- It tuned the network initialization.
This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism.
https://t.co/WAz8aIztKT
All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges.
And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.
Our next kernel competition is now open for submissions! A $1.1M cash prize competition sponsored by AMD on optimizing DeepSeek-R1-0528, GPT-OSS-120B on MI355X
Registration: https://t.co/o1vaXh47CJ
GPU MODE 2026: we’re post-training Kernel LLMs in public and are building all the infra we need to make GPU programming more accessible to all. We're doing this in close collaboration with some of my favorite communities @PrimeIntellect@modal and @LambdaAPI
2025 recap: 26K YouTube subs, 92 lectures, 24K Discord, 3x $100K+ kernel comps, 400K KernelBot submissions,3 events (NVIDIA / Jane Street / Accel) and 10 active working groups!
So for 2026 our concrete goal is to post-train a Kernel LLM and get kernels merged into real repos (PyTorch / vLLM). We plan to share our first results by GTC (San Jose, March 2026). second by ICML (Seoul, July 2026).
work-stream 1: de-slopify LLM kernels (with PyTorch / vLLM / NVIDIA). most generated kernels are verbose, fragile, and non-deterministic. the bar is “maintainer can review + merge”
work-stream 2: post-training Kernel LLM (with Prime Intellect / Modal / Lambda / MIT). We’re betting on two levers: profiler-guided optimization + memory
work-stream 3: competitions as evals. We want more end-to-end system optimizations, they're trickier to design good problems for but the results will be more interesting
work-stream 4: “from scratch” repos. Think: 80% of the performance with 10% of the code (teenygrad, penny ) starting with a minimal RL library optimized for B200
2026 is the year we turn Kernel LLMs from a meme into one of the most reliable ways of improving the performance of AI systems. So if you're interested please join our weekly meetings!
🌶️ Some (perhaps) spicy thoughts. It’s been a while since my last tweet, but I wanted to write about how disorienting it has been from academia to an LLM lab 😅
The kind of research I was trained to do during my PhD almost doesn’t exist here. The obsession with mathematical elegance and novelty is mostly gone. Everything is about scaling data and compute. For a while, that really got to me. At my lowest point, I felt like I’d lost interest in building LLMs altogether. I didn’t feel intellectually challenged anymore.
What made this even stranger was that, at a technical level, things worked. If there was a capability I wanted to teach a model, scaling the right data and compute always got me there, no exception (so far).
But recently, I found a way to reconcile with myself..
I realized the real competition isn’t in the ML recipe anymore. Most teams do roughly the same thing. What actually matters is how fast you can iterate, test ideas, and recover from mistakes. And that speed is mostly backed by infrastructure 🏗️ Faster loops, fewer bugs, better tooling.
Seeing this made me excited again! Infra is its own deep, hard, and intellectually fun problem space.
In 2026, I want to become an ML researcher who’s really good at infra. And I'll come back to ML problems with that edge, and will be excited to share what I find 😌
Our team at Meta FAIR is hiring a PhD research intern for 2026. The topics broadly involve multimodal generative AI (e.g., video/image generation in addition to text), with flexible approaches across architecture/data/algorithms. Please apply via the link below, and feel free to dm me with any questions!
https://t.co/m3a5PDuH69
I recently received my PhD from @uwcse🎓and joined the @PyTorch team at @Meta as a research scientist, building a privacy-aware, on-device AI experience for mobile, desktop, and AR/VR glasses.
New to the Bay Area and excited to connect☕
Data curation is crucial for LLM reasoning, but how do we know if our dataset is not overfit to one benchmark and generalizes to unseen distributions? 🤔
𝐃𝐚𝐭𝐚 𝐝𝐢𝐯𝐞𝐫𝐬𝐢𝐭𝐲 is key, when measured correct—it strongly predicts model generalization in reasoning tasks! 🧵
I’m open to academia & industry in 2025.
My work in #XR 🥽 + #HCI 👩💻 enables low-friction XR experience thru #EmbodiedInteraction, unlocking potential for all -- tech-savvy or not 🌍
Design+Science+Engineering. Let's shape the future of spatial computing ✨
RT appreciated! (1/8)
* interrupting your election doomscrolling *
I'm on the job market! I address online abuse as the next frontier of security & privacy, so all people using tech feel safe. Open to TT faculty/industry research roles in the US/abroad -- lmk if you know of a good home for my work!
“A global ambition for social good”: @UW#UWAllen@uw_cse_seclab’s @_weimf explores complex societal dynamics in usable privacy + security. She received the 2024 Karat Award at @SOUPSConference for her research, mentorship and service. #soups2024#UWserves https://t.co/kihH21JtXb
Great #usesec24 talk from @KaimingCheng on "When the User Is Inside the User Interface: An Empirical Study of UI Security Properties in Augmented Reality" w/ Arka Bhattacharya, Michelle Lin, @jaewook_jae, Aroosh Kumar, Jeffery F. Tian, and @franziroesner: https://t.co/NG3B1pOooW