https://t.co/B4PyeEkYaP v0.8.2 is a CUDA performance release.
On Gemma 4 E4B Q8, https://t.co/B4PyeEkYaP is faster than llama.cpp across the GB10 + B200 + H100 SXM prefill/decode sweep.
We also benchmarked against vLLM in the BS=1 regime and found similar speedups.
Mean speedups:
GB10 prefill: 1.83×
B200 prefill: 2.19×
GB10 decode: 1.09×
B200 decode: 1.24×
Full technical report below.
@ivanfioravanti Hey Ivan, any chance you can add https://t.co/Yu6Ng5njJZ to this? Curious to see how it compares now! I also recently cleaned up the installer + UX so installing the new version should be quick!
Excited to share cuTile Rust: bringing Rust's fearless concurrency to GPU kernel programming. Our paper "Fearless Concurrency on the GPU" is now on arXiv.
For me the best part is how fast you can iterate: high-performance CUDA kernels developed directly in Rust.
Huge thanks to my co-authors Melih Elibol, Jared Roesch, Isaac Gelado, and Michael Garland on this project.
Warp divergence is below the safety line. A tile program is one logical thread, and the memory model orders at tile-block granularity, not per-lane. Tile IR handles warp specialization during lowering, so divergence is a perf/predication concern, not soundness. Data-race freedom rests on disjoint partition views + token ordering, and neither needs lanes to reconverge. The tradeoff is that you give up explicit SIMT control to get that.
For science, AI sovereignty and physics-grounded reasoning are non-negotiable. But how can we teach a small LLM like Gemma-4-E4B physics? One way is to use Agent Skills, but this has so far been limited to closed frontier models. mistral․rs now implements Agent Skills natively: the first self-hosted inference engine that does this as part of the local inference substrate, where we can use small models to solve complex scientific and other tasks in a flexible and scalable way.
We are in a period of uncertainty about frontier models - access, pricing, deprecation, abrupt restriction. The good news is that when the entire stack runs locally we can build AI that is entirely your own: You own the weights, the skills, the execution loop, the data - all of it runs on your hardware and is reproducible and durable.
While virtually all local inference engines expose a model behind an OpenAI-compatible endpoint, everything agentic is then assembled around it by an external orchestrator that injects context, manages tools, mounts files, and brokers execution. mistral․rs is natively agentic and moves that machinery into the server itself, allowing us to build complex agentic workflows and run them locally, on open-source models.
With this new feature you can now upload Agent Skills bundles to /v1/skills, reference them from Responses API requests by identity, and run them inside a native agentic loop with persistent Python sessions, figure capture, sandboxed shell execution, file inputs mounted directly into the working session; plug-and-play and completely compatible with your existing code/workflow.
A model with a native skill substrate can act, observe consequences, and can modify what it is able to do. The skill is retained procedural capability of the system.
Attached is a short video of all of it: skills, code execution, the full agentic loop carried by Gemma-4-E4B; running entirely on my MacBook Pro. You can install and run a server with this capability in two lines in your terminal, with any quantization you need.
Nice work by the @googlegemma team @OfficialLoganK@demishassabis and @ericlbuehler with mistral․rs!
Fearless Concurrency on the GPU
For those interested @melibol just posted a paper on building a safe Rust kernel programming abstraction on top of Tile IR.
https://t.co/MMPxi4oOEg
A short teaser: but the safety is effectively free. On a B200, the safe GEMM is competitive with cuBLAS: about 2 PFlop/s 92% of the GPU's dense f16 roofline.
Read more in the paper or Melih's LinkedIn post (https://t.co/jyyfdC2Vc8)
He will also be giving a talk at RustConf in September, hopefully he will see you there!
Congratulations @googlegemma on their launch of Gemma 4 12B!
https://t.co/Yu6Ng5mLUr has full multimodal & agentic support.
`mistralrs run --agent -m google/gemma-4-12B-it --quant 4`
Meet Gemma 4 12B!
A unified, encoder-free multimodal model designed to bring high-performance intelligence directly to your laptop, and released under an Apache 2.0 license.
Bridging the gap between edge efficiency and advanced reasoning. Here is what’s new with Gemma 4 12B: 👇
Meet Gemma 4 12B!
A unified, encoder-free multimodal model designed to bring high-performance intelligence directly to your laptop, and released under an Apache 2.0 license.
Bridging the gap between edge efficiency and advanced reasoning. Here is what’s new with Gemma 4 12B: 👇
@sakurayukiai Its a mix of many optimizations at every level, from cuda graphs, to minor kernel fusions, to optimized attention and inference engine patterns.
https://t.co/B4PyeEkYaP v0.8.2 is a CUDA performance release.
On Gemma 4 E4B Q8, https://t.co/B4PyeEkYaP is faster than llama.cpp across the GB10 + B200 + H100 SXM prefill/decode sweep.
We also benchmarked against vLLM in the BS=1 regime and found similar speedups.
Mean speedups:
GB10 prefill: 1.83×
B200 prefill: 2.19×
GB10 decode: 1.09×
B200 decode: 1.24×
Full technical report below.
Mistralrs fast, flexible LLM inference written in Rust by @ericlbuehler looks very promising and fast especially on CUDA side.
I took a screenshot of the What makes it fast section below.
I tested it on M5 Max using Qwen3-0.6B bf16 model vs mlx-lm. Video below and final results...
🥇 mlx-lm 303 toks/s
🥈 mistralrs 253 toks/s
I'll keep an eye on this project because looks really promising!