Tal Schuster @TalSchuster - Twitter Profile

Pinned Tweet

almost 4 years ago

Introducing our work @GoogleAI CALM: Confident Adaptive Language Modeling 🧘 Large Language Models don't need their full size for every generated token. We develop an Early Exit framework to significantly #accelerate decoding from #Transformers! 🔗: https://t.co/KOlXTKKoqd 🧵1/

TalSchuster's tweet photo. Introducing our work @GoogleAI CALM: Confident Adaptive Language Modeling 🧘

Large Language Models don't need their full size for every generated token. We develop an Early Exit framework to significantly #accelerate decoding from #Transformers!

🔗: https://t.co/KOlXTKKoqd
🧵1/ https://t.co/birvsL75OS

21

2K

269

365

0

TalSchuster retweeted

Behnam

@OrganicGPT

28 days ago

Local AI at 400 tok/s, zero loss, Gemma 4 31B 🤯 Google's MTP approach is way better than Qwen's because it SCALES with MTP parameter but Qwen peaks at MTP=3. I can now have FULL BF16 Gemma 4 31B at +400 tok/s on RTX 6000 Pro using MTP=100, video coming soon!

OrganicGPT's tweet photo. Local AI at 400 tok/s, zero loss, Gemma 4 31B 🤯 Google's MTP approach is way better than Qwen's because it SCALES with MTP parameter but Qwen peaks at MTP=3. I can now have FULL BF16 Gemma 4 31B at +400 tok/s on RTX 6000 Pro using MTP=100, video coming soon! https://t.co/pprweSz0jG

22

507

46

350

46K

Tal Schuster @TalSchuster

27 days ago

@liranringel @urieli17 For example multilingual output and temp sampling and serving setup without too much wasted compute (e.g. CPU/ macos or batch size > 1 per GPU). But if you have compute to give and want to maximize speed but not throughout, then DDTree looks like a very nice solution!

0

2

0

35

TalSchuster retweeted

Google Gemma

@googlegemma

27 days ago

Gemma 4 up to 3x faster, directly in your phone! 🚀 Check out the difference Speculative Decoding makes! Multi-Token Prediction (MTP) is supercharging inference speeds for Gemma 4.

50

2K

170

491

123K

Who to follow

Hanna Hajishirzi

@HannaHajishirzi

VP@Microsoft-AI; past: Olmo, Tulu

Bill Yuchen Lin

@billyuchenlin

RL for coding @xAI @SpaceX Affiliate Assistant Prof @UW. Ex: @allen_ai; Google, Meta FAIR.

Alexis Ross

@alexisjross

currently @humansand | phd-ing @MIT_CSAIL & working towards personalized AI tutors | formerly @allen_ai, @harvard '20

Tal Schuster @TalSchuster

28 days ago

@zahihod @urieli17 https://t.co/sKP5pCrHsI

Omar Sanseviero

@osanseviero

29 days ago

Gemma 4 Drafters landing across the OS ecosystem ✅transformers ✅VLLM ✅MLX ✅SGLang ✅Ollama ✅AI Edge Gallery And more coming!

28

412

26

85

26K

0

1

0

20

Tal Schuster @TalSchuster

28 days ago

@ZanyMan_e @urieli17 אורך קונטקסט וגודל מודל זה שני דברים שונים

0

11

Tal Schuster @TalSchuster

28 days ago

@liranringel @urieli17 בסטאפים מאוד ספציפיים אולי לפי מה שאני מכיר כרגע. לא בכללי. אבל מעולה שיש התקדמות

1

2

0

28

Tal Schuster @TalSchuster

28 days ago

@urieli17 @LiorParente Was released with day 0 support thanks to great partnership :) https://t.co/pAkjgxqqOD

Prince Canuma

@Prince_Canuma

29 days ago

Congratulations to @GoogleDeepMind on the launch of Gemma 4 Multi-Token-Prediction Drafters 🎉🚀 Happy to have partnered with them for Day-0 support on MLX The new drafters accelerate both single and batch requests by upto 3x. Here is a graph showing how different block sizes affect performance. MLX-VLM release coming soon! PR and model collection 👇🏽

Prince_Canuma's tweet photo. Congratulations to @GoogleDeepMind on the launch of Gemma 4 Multi-Token-Prediction Drafters 🎉🚀

Happy to have partnered with them for Day-0 support on MLX

The new drafters accelerate both single and batch requests by upto 3x.

Here is a graph showing how different block sizes affect performance.

MLX-VLM release coming soon!

PR and model collection 👇🏽

11

244

16

86

22K

1

3

0

35

TalSchuster retweeted

Benjamin Marie

@bnjmn_marie

28 days ago

I benchmarked Google’s new MTP for Gemma 4 31B using vLLM with 4 speculative tokens, a fairly conservative setup. Results: - Much higher throughput than Qwen3.6’s MTP - Lower latency too, helped by Gemma 4 generating fewer tokens - For coding tasks with reasoning enabled, Gemma 4 is now at least 6x faster than Qwen3.6. So you can generate 5 outputs, run your tests to select the best one, and it would still be cheaper than a single output by Qwen3.6. I’ve updated my full comparison with the new numbers: https://t.co/WU9VpQVU2Q I also confirmed what others have reported: Gemma 4’s MTP handles a high number of speculative tokens very well. On simple text generation, I’m now testing values above 10 and reached 129 tok/s on an RTX Pro 6000, compared with 20 tok/s without MTP. Next step: confirming how this translates to real tasks.

bnjmn_marie's tweet photo. I benchmarked Google’s new MTP for Gemma 4 31B using vLLM with 4 speculative tokens, a fairly conservative setup.

Results:
- Much higher throughput than Qwen3.6’s MTP

- Lower latency too, helped by Gemma 4 generating fewer tokens

- For coding tasks with reasoning enabled, Gemma 4 is now at least 6x faster than Qwen3.6. So you can generate 5 outputs, run your tests to select the best one, and it would still be cheaper than a single output by Qwen3.6.

I’ve updated my full comparison with the new numbers:
https://t.co/WU9VpQVU2Q

I also confirmed what others have reported: Gemma 4’s MTP handles a high number of speculative tokens very well.
On simple text generation, I’m now testing values above 10 and reached 129 tok/s on an RTX Pro 6000, compared with 20 tok/s without MTP.

Next step: confirming how this translates to real tasks.

32

330

36

190

34K

Tal Schuster @TalSchuster

28 days ago

@bnjmn_marie Thanks for benchmarking! Nice to see

0

1

0

51

TalSchuster retweeted

stevibe

@stevibe

28 days ago

Google dropped MTP versions of Gemma4. Ran them on my DGX Spark. The 31B dense model went from 3.94 → 8.91 tok/s. That's +126%. Full results: [26B A4B] > 25.24 → 31.69 tok/s (+25.6%) > TTFT 755 → 332ms (-56%) [31B] > 3.94 → 8.91 tok/s (+126%) > TTFT 599 → 378ms (-37%) If you're not running MTP, you're leaving free perf on the table.

19

129

12

87

48K

Tal Schuster @TalSchuster

28 days ago

@zhijianliu_ Really nice to see the fast progress

1

0

65

Tal Schuster @TalSchuster

28 days ago

Nice work from @zhijianliu_'s lab! Native Gemma drafter gives high speedups across the board. For certain cases like low entropy outputs (greedy decoding, structured etc.) and memory bound stup (small bsz+strong device), specialized techniques like this could further boost gen!

Zhijian Liu

@zhijianliu_

29 days ago

DFlash for Gemma 4: Up to 6x Faster. ⚡⚡ Great to see MTP land natively in Gemma 4 today. If you want to push it further, try DFlash — open source, same quality, more speed!! https://t.co/wKcRoibuOB

74

2K

186

1K

470K

1

4

0

1

278

Tal Schuster @TalSchuster

29 days ago

@urieli17 נשמח לשמוע חוות דעת :)

1

0

123

TalSchuster retweeted

AshutoshShrivastava

@ai_for_success

29 days ago

🚨 Google just made Gemma 4 up to 3x faster with MTP ⚡ Same quality, way more speed. It predicts multiple tokens at once and verifies them in parallel, removing latency bottlenecks. You can also run powerful models locally on mobile like me using Google AI Edge Gallery.

14

298

24

81

25K

TalSchuster retweeted

vLLM

@vllm_project

29 days ago

🚀 Day-0 MTP support for Gemma4 now available at vLLM with ready-to-use docker image! ⚡️Enjoy up to 3x faster decoding performance to supercharge your development with zero quality degradation! Check out the full vLLM recipes for Gemma 4 model series👇 https://t.co/IrCaaa6SIo

vllm_project's tweet photo. 🚀 Day-0 MTP support for Gemma4 now available at vLLM with ready-to-use docker image!

⚡️Enjoy up to 3x faster decoding performance to supercharge your development with zero quality degradation!

Check out the full vLLM recipes for Gemma 4 model series👇
https://t.co/IrCaaa6SIo https://t.co/eFcAZRogLF

17

898

99

551

89K

Tal Schuster @TalSchuster

29 days ago

@ttunguz Actually haven't seen it on M5 yet so that's very helpful:)

0

15

Tal Schuster @TalSchuster

29 days ago

We've just released open source MTP style drafters for Gemma 4 models ⚡ Now Gemma 4 models are even faster on your choice of hardware, without losing quality! Grateful for the fruitful collaboration between my team, Gemma team, and many collaborators to enable this release!

Omar Sanseviero

@osanseviero

29 days ago

Excited to introduce Gemma 4 Multi-Token Prediction Drafters⚡️Accelerated inference right in your pockets - Up to a 3x speedup - Same quality guarantees - Available in your favorite open-source tools

47

1K

121

362

149K

4

29

4

6

4K

Tal Schuster @TalSchuster

29 days ago

@ttunguz Very nice! Thanks for sharing

0

1

0

38

TalSchuster retweeted

Tomasz Tunguz

@ttunguz

29 days ago

@TalSchuster 2x speedup on Mac M5 is real.

2

9

2

3

849

TalSchuster retweeted

Google Gemma

@googlegemma

29 days ago

Gemma 4 just got even faster! We're releasing Multi-Token Prediction (MTP) drafters that deliver up to a 3x speedup, without any degradation in output quality or reasoning logic.

98

3K

356

703

206K

Tal Schuster @TalSchuster

29 days ago

And beautiful benchmarks from @Prince_Canuma with MLX on Apple silicon https://t.co/pAkjgxqqOD

Prince Canuma

@Prince_Canuma

29 days ago

Congratulations to @GoogleDeepMind on the launch of Gemma 4 Multi-Token-Prediction Drafters 🎉🚀 Happy to have partnered with them for Day-0 support on MLX The new drafters accelerate both single and batch requests by upto 3x. Here is a graph showing how different block sizes affect performance. MLX-VLM release coming soon! PR and model collection 👇🏽

11

244

16

86

22K

0

6

1

0

2K

Tal Schuster

@TalSchuster

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users