Can an LLM act as a selective model of a GPU during evolutionary search, by reasoning + forecasting a kernel’s runtime but deferring to a GPU when unsure? We produced 12k kernels + runtimes from evolutionary search, costing 400M reasoning tokens + 600 GPU-hours to answer this.
In our work GPU Forecasters, we study language models as selective surrogates for GPU kernel optimization.
1️⃣ Off-the-shelf LLMs can forecast how a GPU responds to a candidate kernel with non-trivial accuracy. If we rank candidates by these predictions and measure only the top 10% on a GPU, the fastest kernel we find is within 20% of the best in the pool.
2️⃣ We want LLMs to not just be accurate but also calibrated, so that we can use their uncertainty for selective prediction: during search, we should trust only confident forecasts and verify less confident forecasts by sending them to the GPU.
3️⃣ We train an open-weights surrogate (GPT-OSS-20B) with RL to improve both accuracy and calibration. Calibration-shaped rewards improve both confidence reliability and ranking ability, while correctness rewards alone do not.
4️⃣ Inside a real kernel search, the surrogate finds faster kernels than an equal-GPU-budget baseline by considering more candidates per measurement.
5️⃣ We release 12,388 LLM-generated GPU kernels with measured runtimes spanning 118 operations, CUDA and Triton backends, 3 GPU types, taking 400M tokens + 600 GPU-hours to produce. This dataset can be used for analyzing LLM-driven evolutionary program search dynamics, post-training LLMs for kernel code generation, and things we didn’t get a chance to explore, like training reward models!
Thread 🧵👇
Can an LLM act as a selective model of a GPU during evolutionary search, by reasoning + forecasting a kernel’s runtime but deferring to a GPU when unsure? We produced 12k kernels + runtimes from evolutionary search, costing 400M reasoning tokens + 600 GPU-hours to answer this.
In our work GPU Forecasters, we study language models as selective surrogates for GPU kernel optimization.
1️⃣ Off-the-shelf LLMs can forecast how a GPU responds to a candidate kernel with non-trivial accuracy. If we rank candidates by these predictions and measure only the top 10% on a GPU, the fastest kernel we find is within 20% of the best in the pool.
2️⃣ We want LLMs to not just be accurate but also calibrated, so that we can use their uncertainty for selective prediction: during search, we should trust only confident forecasts and verify less confident forecasts by sending them to the GPU.
3️⃣ We train an open-weights surrogate (GPT-OSS-20B) with RL to improve both accuracy and calibration. Calibration-shaped rewards improve both confidence reliability and ranking ability, while correctness rewards alone do not.
4️⃣ Inside a real kernel search, the surrogate finds faster kernels than an equal-GPU-budget baseline by considering more candidates per measurement.
5️⃣ We release 12,388 LLM-generated GPU kernels with measured runtimes spanning 118 operations, CUDA and Triton backends, 3 GPU types, taking 400M tokens + 600 GPU-hours to produce. This dataset can be used for analyzing LLM-driven evolutionary program search dynamics, post-training LLMs for kernel code generation, and things we didn’t get a chance to explore, like training reward models!
Thread 🧵👇
I like this! Been curious about whether LLMs can reason through how kernel edits affect space and time. Not sure if roofline analysis by LLMs is trustworthy 🤣
🚨 GPU Forecasters 👉 we explore if a reasoning model can be a selective world model of a GPU, forecasting a kernel's speed while deferring to real hardware when unsure, making kernel search more efficient.
Inside an evolutionary kernel search, the surrogate lets us explore many more candidates in imagination and run only the most promising on the GPU. We often find kernels as fast or faster using the same number of real GPU evaluations.
We also show that reinforcement learning with calibration rewards can teach the surrogate to know when it doesn't know, making it more reliable during search.
We see this as early work toward approximate world models of complex hardware-software systems!
🧵 👇
Appreciate the shoutout @_akhaliq for our work on "GPU Forecasters" exploring whether language models can act as selective surrogates for GPU kernel optimization! Details in our thread: https://t.co/dVOm3IJGwm
🚨LLMs are increasingly used to generate GPU kernels, but evaluating those kernels still requires expensive compilation and execution on real hardware.
Can LLMs act not just as kernel generators, but also forecasting kernel performance and deferring to hardware only when uncertain?
Introducing ✨GPU Forecasters✨, our new study of LLMs as selective surrogates for GPU kernel optimization across:
• 12,388 measured kernels across 118 operations
• CUDA + Triton backends & 3 GPU types
• 400M tokens + 600 GPU-hours
We find that:
1⃣Off-the-shelf LLMs can predict relative kernel performance surprisingly well. Measuring only the top 10% of LLM-ranked candidates recovers kernels within 20% of the best available.
2⃣Accuracy alone isn't enough. A useful surrogate must be calibrated, i.e., knowing when to trust its forecasts and when to defer to the GPU.
3⃣Inside a real evolutionary kernel search, the surrogate evaluates far more candidates under the same GPU budget, leading to faster kernels than an equal-budget baseline.
More results, analysis, and released data in the thread
🧵👇
Can LLMs predict GPU kernel runtimes instead of measuring them on actual hardware?
We find that:
- LLMs act as great selective surrogates (deferring to GPUs when unsure)
- RL improves LLM accuracy & calibration
- Kernel search becomes much more efficient
We're releasing 12K kernels + runtimes for the community to build on.
Great work led by Zaid! Check more details 🧵
GPU kernels are the engines powering NNs, making their optimization a key to self-improving agents. But search over kernels is expensive because eval on hardware takes time.
We train calibrated surrogate models that forecast kernel speedups w/out execution. Calibration is key here as it lets us perform selective prediction, off-loading uncertain predictions to the GPU while trusting more certain ones.
We see this as a first step towards building world models for hardware-software systems!
Key findings:
▪️ We find that off-the-shelf models can perform forecasting and we show how we can use calibration losses to improve them
▪️ We also show how our selective surrogate models can be incorporated into real kernel searches, leading search to converge on faster kernels under the same budget and breaking out of stagnant searches
▪️ Along the way, we built up a sizeable dataset of >12k generated kernels with their runtimes. This is an important resource for future work in this area, and opens up a lot of interesting research directions in predicting kernel performance.
Check out the 🧵 and paper for more details! 👇
Work done with @cyjustinchen@jmin__cho@EliasEskin@mohitban47@unccs@UTCompSci@JHUCompSci! We’d also like to thank @Modal for a generous academic compute grant!
We view this as a first step towards developing world models for complex cyber-physical systems!
Paper: https://t.co/nk967IDyIn
Code: https://t.co/RPs5J8QSY1
HuggingFace Data: https://t.co/Fkn8Q4kSSP
Can an LLM act as a selective model of a GPU during evolutionary search, by reasoning + forecasting a kernel’s runtime but deferring to a GPU when unsure? We produced 12k kernels + runtimes from evolutionary search, costing 400M reasoning tokens + 600 GPU-hours to answer this.
In our work GPU Forecasters, we study language models as selective surrogates for GPU kernel optimization.
1️⃣ Off-the-shelf LLMs can forecast how a GPU responds to a candidate kernel with non-trivial accuracy. If we rank candidates by these predictions and measure only the top 10% on a GPU, the fastest kernel we find is within 20% of the best in the pool.
2️⃣ We want LLMs to not just be accurate but also calibrated, so that we can use their uncertainty for selective prediction: during search, we should trust only confident forecasts and verify less confident forecasts by sending them to the GPU.
3️⃣ We train an open-weights surrogate (GPT-OSS-20B) with RL to improve both accuracy and calibration. Calibration-shaped rewards improve both confidence reliability and ranking ability, while correctness rewards alone do not.
4️⃣ Inside a real kernel search, the surrogate finds faster kernels than an equal-GPU-budget baseline by considering more candidates per measurement.
5️⃣ We release 12,388 LLM-generated GPU kernels with measured runtimes spanning 118 operations, CUDA and Triton backends, 3 GPU types, taking 400M tokens + 600 GPU-hours to produce. This dataset can be used for analyzing LLM-driven evolutionary program search dynamics, post-training LLMs for kernel code generation, and things we didn’t get a chance to explore, like training reward models!
Thread 🧵👇
Where does a surrogate's training data come from? It is a byproduct of running search. Every measured candidate already carries the (reference, candidate, hardware, speedup) tuple a surrogate learns from, so a long-running search produces its own training set.
We release 12,388 LLM-generated GPU kernels with measured runtimes, spanning 118 problems, CUDA and Triton, three GPU types, and four search methods, at a cost of 400M tokens and 600 GPU-hours. Kernel search is computationally expensive. This dataset can be re-used for analyzing LLM-driven evolutionary program search dynamics, post-training LLMs for kernel code generation, and things we didn’t get a chance to explore, like training reward models!
🚨 Excited to share SpatialUncertain — a controlled framework for evaluating whether VLMs know when not to answer spatial questions (and why).
➡️ Spatial reasoning is not just about finding the right answer—it is about knowing whether the available evidence supports an answer at all.
Visual observations can be incomplete or even misleading.
📦 Objects may be hidden by occlusion.
📐 Perspective may create misleading visual cues.
Yet today's VLMs are usually evaluated as if every question has a reliable answer. We introduce SpatialUncertain, a controlled framework for evaluating:
🔍 Can VLMs recognize when visual evidence is insufficient or unreliable?
🧭 Can they identify what additional viewpoints are needed before answering?
Thread🧵👇
I'll be at #CVPR2026, feel free to ping if you want to meet up! Will be giving 4 different keynotes at these exciting @CVPR workshops and looking forward to engaging discussions on diverse topics 🙂
(also happy to discuss hiring at all levels: PhD, postdoc, faculty)
ps. also meet several of our awesome students/postdocs who will be attending