Eric Buehler @ericlbuehler - Twitter Profile

Pinned Tweet

19 days ago

https://t.co/B4PyeEkYaP v0.8.2 is a CUDA performance release. On Gemma 4 E4B Q8, https://t.co/B4PyeEkYaP is faster than llama.cpp across the GB10 + B200 + H100 SXM prefill/decode sweep. We also benchmarked against vLLM in the BS=1 regime and found similar speedups. Mean speedups: GB10 prefill: 1.83× B200 prefill: 2.19× GB10 decode: 1.09× B200 decode: 1.24× Full technical report below.

3

24

1

17

5K

Eric Buehler

@ericlbuehler

1 day ago

@ivanfioravanti Hey Ivan, any chance you can add https://t.co/Yu6Ng5njJZ to this? Curious to see how it compares now! I also recently cleaned up the installer + UX so installing the new version should be quick!

0

1

0

29

Eric Buehler

@ericlbuehler

2 days ago

@AtmoPierce Glad to hear!

0

33

Eric Buehler

@ericlbuehler

3 days ago

Excited to share cuTile Rust: bringing Rust's fearless concurrency to GPU kernel programming. Our paper "Fearless Concurrency on the GPU" is now on arXiv. For me the best part is how fast you can iterate: high-performance CUDA kernels developed directly in Rust. Huge thanks to my co-authors Melih Elibol, Jared Roesch, Isaac Gelado, and Michael Garland on this project.

ericlbuehler's tweet photo. Excited to share cuTile Rust: bringing Rust's fearless concurrency to GPU kernel programming. Our paper "Fearless Concurrency on the GPU" is now on arXiv.

For me the best part is how fast you can iterate: high-performance CUDA kernels developed directly in Rust.

Huge thanks to my co-authors Melih Elibol, Jared Roesch, Isaac Gelado, and Michael Garland on this project.

4

166

18

92

9K

Eric Buehler

@ericlbuehler

3 days ago

Warp divergence is below the safety line. A tile program is one logical thread, and the memory model orders at tile-block granularity, not per-lane. Tile IR handles warp specialization during lowering, so divergence is a perf/predication concern, not soundness. Data-race freedom rests on disjoint partition views + token ordering, and neither needs lanes to reconverge. The tradeoff is that you give up explicit SIMT control to get that.

0

1

0

1

178

ericlbuehler retweeted

Charles 🎉 Frye

@charles_irl

3 days ago

CuTile-rs paper! https://t.co/5zOOYfS8Pz

3

427

51

266

17K

ericlbuehler retweeted

Markus J. Buehler

@ProfBuehlerMIT

3 days ago

For science, AI sovereignty and physics-grounded reasoning are non-negotiable. But how can we teach a small LLM like Gemma-4-E4B physics? One way is to use Agent Skills, but this has so far been limited to closed frontier models. mistral․rs now implements Agent Skills natively: the first self-hosted inference engine that does this as part of the local inference substrate, where we can use small models to solve complex scientific and other tasks in a flexible and scalable way. We are in a period of uncertainty about frontier models - access, pricing, deprecation, abrupt restriction. The good news is that when the entire stack runs locally we can build AI that is entirely your own: You own the weights, the skills, the execution loop, the data - all of it runs on your hardware and is reproducible and durable. While virtually all local inference engines expose a model behind an OpenAI-compatible endpoint, everything agentic is then assembled around it by an external orchestrator that injects context, manages tools, mounts files, and brokers execution. mistral․rs is natively agentic and moves that machinery into the server itself, allowing us to build complex agentic workflows and run them locally, on open-source models. With this new feature you can now upload Agent Skills bundles to /v1/skills, reference them from Responses API requests by identity, and run them inside a native agentic loop with persistent Python sessions, figure capture, sandboxed shell execution, file inputs mounted directly into the working session; plug-and-play and completely compatible with your existing code/workflow. A model with a native skill substrate can act, observe consequences, and can modify what it is able to do. The skill is retained procedural capability of the system. Attached is a short video of all of it: skills, code execution, the full agentic loop carried by Gemma-4-E4B; running entirely on my MacBook Pro. You can install and run a server with this capability in two lines in your terminal, with any quantization you need. Nice work by the @googlegemma team @OfficialLoganK @demishassabis and @ericlbuehler with mistral․rs!

7

104

19

94

10K

ericlbuehler retweeted

Jared Roesch

@roeschinc

3 days ago

Fearless Concurrency on the GPU For those interested @melibol just posted a paper on building a safe Rust kernel programming abstraction on top of Tile IR. https://t.co/MMPxi4oOEg A short teaser: but the safety is effectively free. On a B200, the safe GEMM is competitive with cuBLAS: about 2 PFlop/s 92% of the GPU's dense f16 roofline. Read more in the paper or Melih's LinkedIn post (https://t.co/jyyfdC2Vc8) He will also be giving a talk at RustConf in September, hopefully he will see you there!

1

194

30

139

8K

Eric Buehler

@ericlbuehler

3 days ago

Paper: https://t.co/FJg8nL8Q7P Code: https://t.co/JiMj8y9ZIV

0

9

2

7

557

Eric Buehler

@ericlbuehler

16 days ago

@thiezn_ @googlegemma Thank you @thiezn_, much appreciated!

0

14

Eric Buehler

@ericlbuehler

17 days ago

Congratulations @googlegemma on their launch of Gemma 4 12B! https://t.co/Yu6Ng5mLUr has full multimodal & agentic support. `mistralrs run --agent -m google/gemma-4-12B-it --quant 4`

Google Gemma

@googlegemma

17 days ago

Meet Gemma 4 12B! A unified, encoder-free multimodal model designed to bring high-performance intelligence directly to your laptop, and released under an Apache 2.0 license. Bridging the gap between edge efficiency and advanced reasoning. Here is what’s new with Gemma 4 12B: 👇

googlegemma's tweet photo. Meet Gemma 4 12B!

A unified, encoder-free multimodal model designed to bring high-performance intelligence directly to your laptop, and released under an Apache 2.0 license.

Bridging the gap between edge efficiency and advanced reasoning. Here is what’s new with Gemma 4 12B: 👇 https://t.co/gf4FZv0WZb

403

12K

2K

5K

3M

2

5

1

388

Eric Buehler

@ericlbuehler

17 days ago

GitHub: https://t.co/sekzbwuLfD

0

3

0

54

ericlbuehler retweeted

Google Gemma

@googlegemma

17 days ago

Meet Gemma 4 12B! A unified, encoder-free multimodal model designed to bring high-performance intelligence directly to your laptop, and released under an Apache 2.0 license. Bridging the gap between edge efficiency and advanced reasoning. Here is what’s new with Gemma 4 12B: 👇

403

12K

2K

5K

3M

Eric Buehler

@ericlbuehler

18 days ago

@sakurayukiai Its a mix of many optimizations at every level, from cuda graphs, to minor kernel fusions, to optimized attention and inference engine patterns.

0

23

Eric Buehler

@ericlbuehler

19 days ago

https://t.co/B4PyeEkYaP v0.8.2 is a CUDA performance release. On Gemma 4 E4B Q8, https://t.co/B4PyeEkYaP is faster than llama.cpp across the GB10 + B200 + H100 SXM prefill/decode sweep. We also benchmarked against vLLM in the BS=1 regime and found similar speedups. Mean speedups: GB10 prefill: 1.83× B200 prefill: 2.19× GB10 decode: 1.09× B200 decode: 1.24× Full technical report below.

3

24

1

17

5K

Eric Buehler

@ericlbuehler

19 days ago

@ivanfioravanti This error should be fixed now :) https://t.co/KvZIm1N5UD

1

2

1

0

58

Eric Buehler

@ericlbuehler

19 days ago

@ivanfioravanti Thanks @ivanfioravanti! Much more work to do on the Metal side 🚀

0

2

0

62

ericlbuehler retweeted

Ivan Fioravanti ᯅ

@ivanfioravanti

19 days ago

Mistralrs fast, flexible LLM inference written in Rust by @ericlbuehler looks very promising and fast especially on CUDA side. I took a screenshot of the What makes it fast section below. I tested it on M5 Max using Qwen3-0.6B bf16 model vs mlx-lm. Video below and final results... 🥇 mlx-lm 303 toks/s 🥈 mistralrs 253 toks/s I'll keep an eye on this project because looks really promising!