Zaid Khan @codezakh - Twitter Profile

Pinned Tweet

2 days ago

Can an LLM act as a selective model of a GPU during evolutionary search, by reasoning + forecasting a kernel’s runtime but deferring to a GPU when unsure? We produced 12k kernels + runtimes from evolutionary search, costing 400M reasoning tokens + 600 GPU-hours to answer this. In our work GPU Forecasters, we study language models as selective surrogates for GPU kernel optimization. 1️⃣ Off-the-shelf LLMs can forecast how a GPU responds to a candidate kernel with non-trivial accuracy. If we rank candidates by these predictions and measure only the top 10% on a GPU, the fastest kernel we find is within 20% of the best in the pool. 2️⃣ We want LLMs to not just be accurate but also calibrated, so that we can use their uncertainty for selective prediction: during search, we should trust only confident forecasts and verify less confident forecasts by sending them to the GPU. 3️⃣ We train an open-weights surrogate (GPT-OSS-20B) with RL to improve both accuracy and calibration. Calibration-shaped rewards improve both confidence reliability and ranking ability, while correctness rewards alone do not. 4️⃣ Inside a real kernel search, the surrogate finds faster kernels than an equal-GPU-budget baseline by considering more candidates per measurement. 5️⃣ We release 12,388 LLM-generated GPU kernels with measured runtimes spanning 118 operations, CUDA and Triton backends, 3 GPU types, taking 400M tokens + 600 GPU-hours to produce. This dataset can be used for analyzing LLM-driven evolutionary program search dynamics, post-training LLMs for kernel code generation, and things we didn’t get a chance to explore, like training reward models! Thread 🧵👇

codezakh's tweet photo. Can an LLM act as a selective model of a GPU during evolutionary search, by reasoning + forecasting a kernel’s runtime but deferring to a GPU when unsure? We produced 12k kernels + runtimes from evolutionary search, costing 400M reasoning tokens + 600 GPU-hours to answer this.

In our work GPU Forecasters, we study language models as selective surrogates for GPU kernel optimization.

1️⃣ Off-the-shelf LLMs can forecast how a GPU responds to a candidate kernel with non-trivial accuracy. If we rank candidates by these predictions and measure only the top 10% on a GPU, the fastest kernel we find is within 20% of the best in the pool.

2️⃣ We want LLMs to not just be accurate but also calibrated, so that we can use their uncertainty for selective prediction: during search, we should trust only confident forecasts and verify less confident forecasts by sending them to the GPU.

3️⃣ We train an open-weights surrogate (GPT-OSS-20B) with RL to improve both accuracy and calibration. Calibration-shaped rewards improve both confidence reliability and ranking ability, while correctness rewards alone do not.

4️⃣ Inside a real kernel search, the surrogate finds faster kernels than an equal-GPU-budget baseline by considering more candidates per measurement.

5️⃣ We release 12,388 LLM-generated GPU kernels with measured runtimes spanning 118 operations, CUDA and Triton backends, 3 GPU types, taking 400M tokens + 600 GPU-hours to produce. This dataset can be used for analyzing LLM-driven evolutionary program search dynamics, post-training LLMs for kernel code generation, and things we didn’t get a chance to explore, like training reward models!

Thread 🧵👇

5

81

34

33

14K

codezakh retweeted

Zaid Khan

@codezakh

2 days ago

Can an LLM act as a selective model of a GPU during evolutionary search, by reasoning + forecasting a kernel’s runtime but deferring to a GPU when unsure? We produced 12k kernels + runtimes from evolutionary search, costing 400M reasoning tokens + 600 GPU-hours to answer this. In our work GPU Forecasters, we study language models as selective surrogates for GPU kernel optimization. 1️⃣ Off-the-shelf LLMs can forecast how a GPU responds to a candidate kernel with non-trivial accuracy. If we rank candidates by these predictions and measure only the top 10% on a GPU, the fastest kernel we find is within 20% of the best in the pool. 2️⃣ We want LLMs to not just be accurate but also calibrated, so that we can use their uncertainty for selective prediction: during search, we should trust only confident forecasts and verify less confident forecasts by sending them to the GPU. 3️⃣ We train an open-weights surrogate (GPT-OSS-20B) with RL to improve both accuracy and calibration. Calibration-shaped rewards improve both confidence reliability and ranking ability, while correctness rewards alone do not. 4️⃣ Inside a real kernel search, the surrogate finds faster kernels than an equal-GPU-budget baseline by considering more candidates per measurement. 5️⃣ We release 12,388 LLM-generated GPU kernels with measured runtimes spanning 118 operations, CUDA and Triton backends, 3 GPU types, taking 400M tokens + 600 GPU-hours to produce. This dataset can be used for analyzing LLM-driven evolutionary program search dynamics, post-training LLMs for kernel code generation, and things we didn’t get a chance to explore, like training reward models! Thread 🧵👇

5

81

34

33

14K

codezakh retweeted

Justin T Chiu

@justintchiu

2 days ago

I like this! Been curious about whether LLMs can reason through how kernel edits affect space and time. Not sure if roofline analysis by LLMs is trustworthy 🤣

0

2

1

3

581

codezakh retweeted

Mohit Bansal

@mohitban47

2 days ago

🚨 GPU Forecasters 👉 we explore if a reasoning model can be a selective world model of a GPU, forecasting a kernel's speed while deferring to real hardware when unsure, making kernel search more efficient. Inside an evolutionary kernel search, the surrogate lets us explore many more candidates in imagination and run only the most promising on the GPU. We often find kernels as fast or faster using the same number of real GPU evaluations. We also show that reinforcement learning with calibration rewards can teach the surrogate to know when it doesn't know, making it more reliable during search. We see this as early work toward approximate world models of complex hardware-software systems! 🧵 👇

1

26

8

5

3K

Who to follow

Alex Thiery

@alexxthiery

Associate Prof. of Statistics & Machine Learning National University of Singapore (NUS)

Molly Miller

@Molly_M_Miller

ID/OPAT/ASP Pharmacist @NebraskaMed @UNMC_ID

Elizabeth Mieczkowski

@beth_miecz

Studying multi-agent teams with @CoCoSci_Lab & @Velez_CoLab. PhD Candidate @PrincetonCS. Prev: @CornellCIS 2021, Lab Tech @MITBrainandCog.

Zaid Khan

@codezakh

2 days ago

Appreciate the shoutout @_akhaliq for our work on "GPU Forecasters" exploring whether language models can act as selective surrogates for GPU kernel optimization! Details in our thread: https://t.co/dVOm3IJGwm

AK

@_akhaliq

3 days ago

GPU Forecasters Language Models as Selective Surrogates for Kernel Runtime Optimization

6

99

10

46

34K

0

26

4

9

11K

codezakh retweeted

Justin Chih-Yao Chen

@cyjustinchen

2 days ago

🚨LLMs are increasingly used to generate GPU kernels, but evaluating those kernels still requires expensive compilation and execution on real hardware. Can LLMs act not just as kernel generators, but also forecasting kernel performance and deferring to hardware only when uncertain? Introducing ✨GPU Forecasters✨, our new study of LLMs as selective surrogates for GPU kernel optimization across: • 12,388 measured kernels across 118 operations • CUDA + Triton backends & 3 GPU types • 400M tokens + 600 GPU-hours We find that: 1⃣Off-the-shelf LLMs can predict relative kernel performance surprisingly well. Measuring only the top 10% of LLM-ranked candidates recovers kernels within 20% of the best available. 2⃣Accuracy alone isn't enough. A useful surrogate must be calibrated, i.e., knowing when to trust its forecasts and when to defer to the GPU. 3⃣Inside a real evolutionary kernel search, the surrogate evaluates far more candidates under the same GPU budget, leading to faster kernels than an equal-budget baseline. More results, analysis, and released data in the thread 🧵👇

1

14

7

4

2K

codezakh retweeted

Jaemin Cho

@jmin__cho

2 days ago

Can LLMs predict GPU kernel runtimes instead of measuring them on actual hardware? We find that: - LLMs act as great selective surrogates (deferring to GPUs when unsure) - RL improves LLM accuracy & calibration - Kernel search becomes much more efficient We're releasing 12K kernels + runtimes for the community to build on. Great work led by Zaid! Check more details 🧵

2

35

8

15

5K

codezakh retweeted

Elias Stengel-Eskin

@EliasEskin

2 days ago

GPU kernels are the engines powering NNs, making their optimization a key to self-improving agents. But search over kernels is expensive because eval on hardware takes time. We train calibrated surrogate models that forecast kernel speedups w/out execution. Calibration is key here as it lets us perform selective prediction, off-loading uncertain predictions to the GPU while trusting more certain ones. We see this as a first step towards building world models for hardware-software systems! Key findings: ▪️ We find that off-the-shelf models can perform forecasting and we show how we can use calibration losses to improve them ▪️ We also show how our selective surrogate models can be incorporated into real kernel searches, leading search to converge on faster kernels under the same budget and breaking out of stagnant searches ▪️ Along the way, we built up a sizeable dataset of >12k generated kernels with their runtimes. This is an important resource for future work in this area, and opens up a lot of interesting research directions in predicting kernel performance. Check out the 🧵 and paper for more details! 👇

0

12

6

3

1K

codezakh retweeted

AK

@_akhaliq

3 days ago

GPU Forecasters Language Models as Selective Surrogates for Kernel Runtime Optimization

6

99

10

46

34K

Zaid Khan

@codezakh

2 days ago

Work done with @cyjustinchen @jmin__cho @EliasEskin @mohitban47 @unccs @UTCompSci @JHUCompSci! We’d also like to thank @Modal for a generous academic compute grant! We view this as a first step towards developing world models for complex cyber-physical systems! Paper: https://t.co/nk967IDyIn Code: https://t.co/RPs5J8QSY1 HuggingFace Data: https://t.co/Fkn8Q4kSSP

0

13

2

0

319

Zaid Khan

@codezakh

2 days ago

Can an LLM act as a selective model of a GPU during evolutionary search, by reasoning + forecasting a kernel’s runtime but deferring to a GPU when unsure? We produced 12k kernels + runtimes from evolutionary search, costing 400M reasoning tokens + 600 GPU-hours to answer this. In our work GPU Forecasters, we study language models as selective surrogates for GPU kernel optimization. 1️⃣ Off-the-shelf LLMs can forecast how a GPU responds to a candidate kernel with non-trivial accuracy. If we rank candidates by these predictions and measure only the top 10% on a GPU, the fastest kernel we find is within 20% of the best in the pool. 2️⃣ We want LLMs to not just be accurate but also calibrated, so that we can use their uncertainty for selective prediction: during search, we should trust only confident forecasts and verify less confident forecasts by sending them to the GPU. 3️⃣ We train an open-weights surrogate (GPT-OSS-20B) with RL to improve both accuracy and calibration. Calibration-shaped rewards improve both confidence reliability and ranking ability, while correctness rewards alone do not. 4️⃣ Inside a real kernel search, the surrogate finds faster kernels than an equal-GPU-budget baseline by considering more candidates per measurement. 5️⃣ We release 12,388 LLM-generated GPU kernels with measured runtimes spanning 118 operations, CUDA and Triton backends, 3 GPU types, taking 400M tokens + 600 GPU-hours to produce. This dataset can be used for analyzing LLM-driven evolutionary program search dynamics, post-training LLMs for kernel code generation, and things we didn’t get a chance to explore, like training reward models! Thread 🧵👇

5

81

34

33

14K

Zaid Khan

@codezakh

2 days ago

Where does a surrogate's training data come from? It is a byproduct of running search. Every measured candidate already carries the (reference, candidate, hardware, speedup) tuple a surrogate learns from, so a long-running search produces its own training set. We release 12,388 LLM-generated GPU kernels with measured runtimes, spanning 118 problems, CUDA and Triton, three GPU types, and four search methods, at a cost of 400M tokens and 600 GPU-hours. Kernel search is computationally expensive. This dataset can be re-used for analyzing LLM-driven evolutionary program search dynamics, post-training LLMs for kernel code generation, and things we didn’t get a chance to explore, like training reward models!

1

4

0

156

codezakh retweeted

Yue Zhang

@zhan1624

4 days ago

🚨 Excited to share SpatialUncertain — a controlled framework for evaluating whether VLMs know when not to answer spatial questions (and why). ➡️ Spatial reasoning is not just about finding the right answer—it is about knowing whether the available evidence supports an answer at all. Visual observations can be incomplete or even misleading. 📦 Objects may be hidden by occlusion. 📐 Perspective may create misleading visual cues. Yet today's VLMs are usually evaluated as if every question has a reliable answer. We introduce SpatialUncertain, a controlled framework for evaluating: 🔍 Can VLMs recognize when visual evidence is insufficient or unreliable? 🧭 Can they identify what additional viewpoints are needed before answering? Thread🧵👇

zhan1624's tweet photo. 🚨 Excited to share SpatialUncertain — a controlled framework for evaluating whether VLMs know when not to answer spatial questions (and why).

➡️ Spatial reasoning is not just about finding the right answer—it is about knowing whether the available evidence supports an answer at all.

Visual observations can be incomplete or even misleading.
📦 Objects may be hidden by occlusion.
📐 Perspective may create misleading visual cues.

Yet today's VLMs are usually evaluated as if every question has a reliable answer. We introduce SpatialUncertain, a controlled framework for evaluating:
🔍 Can VLMs recognize when visual evidence is insufficient or unreliable?
🧭 Can they identify what additional viewpoints are needed before answering?

Thread🧵👇

5

144

46

71

18K

codezakh retweeted

Mohit Bansal

@mohitban47

4 days ago

I'll be at #CVPR2026, feel free to ping if you want to meet up! Will be giving 4 different keynotes at these exciting @CVPR workshops and looking forward to engaging discussions on diverse topics 🙂 (also happy to discuss hiring at all levels: PhD, postdoc, faculty) ps. also meet several of our awesome students/postdocs who will be attending

mohitban47's tweet photo. I'll be at #CVPR2026, feel free to ping if you want to meet up! Will be giving 4 different keynotes at these exciting @CVPR workshops and looking forward to engaging discussions on diverse topics 🙂

(also happy to discuss hiring at all levels: PhD, postdoc, faculty)

ps. also meet several of our awesome students/postdocs who will be attending

1

64

24

3

4K

Zaid Khan

@codezakh

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users