Introducing our work @GoogleAI CALM: Confident Adaptive Language Modeling 🧘
Large Language Models don't need their full size for every generated token. We develop an Early Exit framework to significantly #accelerate decoding from #Transformers!
🔗: https://t.co/KOlXTKKoqd
🧵1/
Local AI at 400 tok/s, zero loss, Gemma 4 31B 🤯 Google's MTP approach is way better than Qwen's because it SCALES with MTP parameter but Qwen peaks at MTP=3. I can now have FULL BF16 Gemma 4 31B at +400 tok/s on RTX 6000 Pro using MTP=100, video coming soon!
@liranringel@urieli17 For example multilingual output and temp sampling and serving setup without too much wasted compute (e.g. CPU/ macos or batch size > 1 per GPU).
But if you have compute to give and want to maximize speed but not throughout, then DDTree looks like a very nice solution!
Gemma 4 up to 3x faster, directly in your phone! 🚀
Check out the difference Speculative Decoding makes! Multi-Token Prediction (MTP) is supercharging inference speeds for Gemma 4.
Congratulations to @GoogleDeepMind on the launch of Gemma 4 Multi-Token-Prediction Drafters 🎉🚀
Happy to have partnered with them for Day-0 support on MLX
The new drafters accelerate both single and batch requests by upto 3x.
Here is a graph showing how different block sizes affect performance.
MLX-VLM release coming soon!
PR and model collection 👇🏽
I benchmarked Google’s new MTP for Gemma 4 31B using vLLM with 4 speculative tokens, a fairly conservative setup.
Results:
- Much higher throughput than Qwen3.6’s MTP
- Lower latency too, helped by Gemma 4 generating fewer tokens
- For coding tasks with reasoning enabled, Gemma 4 is now at least 6x faster than Qwen3.6. So you can generate 5 outputs, run your tests to select the best one, and it would still be cheaper than a single output by Qwen3.6.
I’ve updated my full comparison with the new numbers:
https://t.co/WU9VpQVU2Q
I also confirmed what others have reported: Gemma 4’s MTP handles a high number of speculative tokens very well.
On simple text generation, I’m now testing values above 10 and reached 129 tok/s on an RTX Pro 6000, compared with 20 tok/s without MTP.
Next step: confirming how this translates to real tasks.
Google dropped MTP versions of Gemma4. Ran them on my DGX Spark.
The 31B dense model went from 3.94 → 8.91 tok/s. That's +126%.
Full results:
[26B A4B]
> 25.24 → 31.69 tok/s (+25.6%)
> TTFT 755 → 332ms (-56%)
[31B]
> 3.94 → 8.91 tok/s (+126%)
> TTFT 599 → 378ms (-37%)
If you're not running MTP, you're leaving free perf on the table.
Nice work from @zhijianliu_'s lab!
Native Gemma drafter gives high speedups across the board. For certain cases like low entropy outputs (greedy decoding, structured etc.) and memory bound stup (small bsz+strong device), specialized techniques like this could further boost gen!
DFlash for Gemma 4: Up to 6x Faster. ⚡⚡
Great to see MTP land natively in Gemma 4 today. If you want to push it further, try DFlash — open source, same quality, more speed!!
https://t.co/wKcRoibuOB
🚨 Google just made Gemma 4 up to 3x faster with MTP ⚡
Same quality, way more speed. It predicts multiple tokens at once and verifies them in parallel, removing latency bottlenecks.
You can also run powerful models locally on mobile like me using Google AI Edge Gallery.
🚀 Day-0 MTP support for Gemma4 now available at vLLM with ready-to-use docker image!
⚡️Enjoy up to 3x faster decoding performance to supercharge your development with zero quality degradation!
Check out the full vLLM recipes for Gemma 4 model series👇
https://t.co/IrCaaa6SIo
We've just released open source MTP style drafters for Gemma 4 models ⚡
Now Gemma 4 models are even faster on your choice of hardware, without losing quality!
Grateful for the fruitful collaboration between my team, Gemma team, and many collaborators to enable this release!
Excited to introduce Gemma 4 Multi-Token Prediction Drafters⚡️Accelerated inference right in your pockets
- Up to a 3x speedup
- Same quality guarantees
- Available in your favorite open-source tools
Gemma 4 just got even faster!
We're releasing Multi-Token Prediction (MTP) drafters that deliver up to a 3x speedup, without any degradation in output quality or reasoning logic.
Congratulations to @GoogleDeepMind on the launch of Gemma 4 Multi-Token-Prediction Drafters 🎉🚀
Happy to have partnered with them for Day-0 support on MLX
The new drafters accelerate both single and batch requests by upto 3x.
Here is a graph showing how different block sizes affect performance.
MLX-VLM release coming soon!
PR and model collection 👇🏽