Markus Nagel @mnagel87 - Twitter Profile

12 months ago

I am excited to host and present at our #CVPR2025 tutorial on Power-efficient Neural Networks Using Low-precision Data Types and Quantization. This is together with @TiRune (Meta) and Thomas Pfeil (Recogni). 📅 Thursday Jun 12, 13:00 🏠 CVPR, room 205B ℹ️ https://t.co/ROdi9S7dF0

0

5

0

1

121

mnagel87 retweeted

Andrii Skliar 🇺🇦 @avskliar

over 1 year ago

Proud to present our work on optimizing Mixture of Experts models for on-device generation speed: https://t.co/QrsTpgjwAN We introduce a cache-aware routing that boosts memory efficiency of commonly used MoEs, improving generation throughput by 2×—all without retraining. Perfect for real-world, memory-constrained devices. This is a joint work with wonderful team here at Qualcomm: @tivaro @r_lepert @BabakEht Todor Boinovski @mnagel87 @martvanbaalen and Paul Whatmough

1

7

4

0

372

Markus Nagel @mnagel87

over 1 year ago

Are you pursuing a PhD and are you interested in working on efficiency of LLMs/LVMs? Then join our model efficiency team in #QualcommAIResearch for an internship! Apply below, we have openings for 2025 as well as autumn/winter 2024. https://t.co/vNrUT9wzrB

3

179

27

136

22K

Markus Nagel @mnagel87

almost 2 years ago

Interested in boosting quantized LLM performance with QAT? Check out our latest work on Low-Rank Quantization-Aware Training (LR-QAT) which can train 7B LLMs on a single consumer-grade GPU with just 24GB of memory. New work with @yell1337 and @delchia https://t.co/xFkLltbghM

mnagel87's tweet photo. Interested in boosting quantized LLM performance with QAT?

Check out our latest work on Low-Rank Quantization-Aware Training (LR-QAT) which can train 7B LLMs on a single consumer-grade GPU with just 24GB of memory.

New work with @yell1337 and @delchia
https://t.co/xFkLltbghM https://t.co/z6IPjfzRfO

0

31

8

1

2K

Who to follow

Amir Habibian

@amir_habibian

Research Scientist (Director) @Qualcomm AI Research

Babak Ehteshami Bejnordi

@BabakEht

Research Scientist@Qualcomm AI Research: Deep learning, Conditional computation, Model Efficiency, LLM/Vision

Tijmen Blankevoort

@TiRune

Deep Learning Researcher Nvidia - Efficiency/Numerics

mnagel87 retweeted

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

almost 2 years ago

Sparse High Rank Adapters abs: https://t.co/Cyh7GRUS7S "In this paper, we propose Sparse High Rank Adapters (SHiRA), a new paradigm which incurs no inference overhead, enables rapid switching, and significantly reduces concept-loss. Specifically, SHiRA can be trained by directly tuning only 1-2% of the base model weights while leaving others unchanged."

iScienceLuvr's tweet photo. Sparse High Rank Adapters

abs: https://t.co/Cyh7GRUS7S

"In this paper, we propose Sparse High Rank Adapters (SHiRA), a new paradigm which incurs no inference overhead, enables rapid switching, and significantly reduces concept-loss. Specifically, SHiRA can be trained by directly tuning only 1-2% of the base model weights while leaving others unchanged."

4

141

26

86

13K

mnagel87 retweeted

AK

@_akhaliq

over 2 years ago

Qualcomm presents GPTVQ The Blessing of Dimensionality for LLM Quantization show that the size versus accuracy trade-off of neural network quantization can be significantly improved by increasing the quantization dimensionality. We propose the GPTVQ method, a new fast method for post-training vector quantization (VQ) that scales well to Large Language Models (LLMs). Our method interleaves quantization of one or more columns with updates to the remaining unquantized weights, using information from the Hessian of the per-layer output reconstruction MSE. Quantization codebooks are initialized using an efficient data-aware version of the EM algorithm. The codebooks are then updated, and further compressed by using integer quantization and SVD-based compression. GPTVQ establishes a new state-of-the art in the size vs accuracy trade-offs on a wide range of LLMs such as Llama-v2 and Mistral. Furthermore, our method is efficient: on a single H100 it takes between 3 and 11 hours to process a Llamav2-70B model, depending on quantization setting. Lastly, with on-device timings for VQ decompression on a mobile CPU we show that VQ leads to improved latency compared to using a 4-bit integer format.

_akhaliq's tweet photo. Qualcomm presents GPTVQ

The Blessing of Dimensionality for LLM Quantization

show that the size versus accuracy trade-off of neural network quantization can be significantly improved by increasing the quantization dimensionality. We propose the GPTVQ method, a new fast method for post-training vector quantization (VQ) that scales well to Large Language Models (LLMs). Our method interleaves quantization of one or more columns with updates to the remaining unquantized weights, using information from the Hessian of the per-layer output reconstruction MSE. Quantization codebooks are initialized using an efficient data-aware version of the EM algorithm. The codebooks are then updated, and further compressed by using integer quantization and SVD-based compression. GPTVQ establishes a new state-of-the art in the size vs accuracy trade-offs on a wide range of LLMs such as Llama-v2 and Mistral. Furthermore, our method is efficient: on a single H100 it takes between 3 and 11 hours to process a Llamav2-70B model, depending on quantization setting. Lastly, with on-device timings for VQ decompression on a mobile CPU we show that VQ leads to improved latency compared to using a 4-bit integer format.

5

371

62

178

56K

Markus Nagel @mnagel87

over 2 years ago

I'm excited that our paper 'The LLM Surgeon' got accepted at #ICLR2024. This is work by @tychovdo during his internship at #QualcommAIResearch in collaboration with @martvanbaalen, @TiRune and @y_m_asano. Check out the thread below for more details.👇

Tycho van der Ouderaa

@tychovdo

over 2 years ago

⭐️New paper ⭐️ Excited to share 'The LLM Surgeon', accepted at ICLR 2024. We obtain SOTA pruning performance and even demonstrate structured LLM pruning of full rows and cols. Direct practical impact enabling compression up to 20-30% with negligible loss in performance.🧵1/9👇

tychovdo's tweet photo. ⭐️New paper ⭐️ Excited to share 'The LLM Surgeon', accepted at ICLR 2024. We obtain SOTA pruning performance and even demonstrate structured LLM pruning of full rows and cols. Direct practical impact enabling compression up to 20-30% with negligible loss in performance.🧵1/9👇

3

108

12

38

19K

0

21

1

1K

mnagel87 retweeted

The TWIML AI Podcast

@twimlai

over 2 years ago

Today on the podcast we feature our conversation with @mnagel87 from @Qualcomm AI Research to discuss his accepted papers at NeurIPS along with other papers and demos presented by the @QCOMResearch team. 🎧/🎥: https://t.co/lzmcFurgUm.

twimlai's tweet photo. Today on the podcast we feature our conversation with @mnagel87 from @Qualcomm AI Research to discuss his accepted papers at NeurIPS along with other papers and demos presented by the @QCOMResearch team.

🎧/🎥: https://t.co/lzmcFurgUm. https://t.co/9fOduAwUDs

0

16

4

1

2K

mnagel87 retweeted

The TWIML AI Podcast

@twimlai

over 2 years ago

The approach of removing outliers, as emphasized in Markus’ Quantizable Transformers paper, is not a quantization method but rather an approach to address and eliminate the root cause of activation quantization issues. Catch @mnagel87’s episode at https://t.co/6VsXlpejTP.

0

9

2

0

590

Markus Nagel @mnagel87

over 2 years ago

If you are around and want to connect, please send me a PM. You can also find me at our poster sessions or the Qualcomm booth. Poster sessions: - Pruning vs Quantization: Tuesday 17:15-19:15 - Quantizable Transformers: Thursday 17:00-19:00

0

151

Markus Nagel @mnagel87

over 2 years ago

This week I’m at #NeurIPS2023 to present our recent model efficiency research: 1) Pruning vs Quantization: Which is Better?, w/ A Kuzmin, @martvanbaalen, A Behboodi, @TiRune 2) Quantizable Transformers: Removing Outliers by Helping Attention Heads do Nothing, w/ @yell1337,@TiRune

1

7

1

0

517

mnagel87 retweeted

Qualcomm Research & Technologies

@QCOMResearch

over 2 years ago

Read about our #NeurIPS2023 plans including four accepted demos on fast stable diffusion, on-device learning (ODL), fast #AI assistant, and generative relighting, as well as six accepted machine learning papers. See you soon in New Orleans! https://t.co/iI15R0UaMt

0

9

3

1

900

Markus Nagel @mnagel87

over 2 years ago

ResQ: Residual Quantization for Video Perception Davide Abati, Haitam Ben Yahia, Markus Nagel, Amirhossein Habibian https://t.co/PeAvBYjvF3 Friday 6th @ 10:30 AM-12:30 PM (room nord, poster 102)

0

2

0

219

Markus Nagel @mnagel87

over 2 years ago

This week I'm in Paris at #ICCV2023 to present some of our recent work on model efficiency and quantization. Please join me for our talks and posters or at the Qualcomm booth. (1/4, schedule follows)

2

10

1

2

1K

Markus Nagel @mnagel87

over 2 years ago

QBitOpt: Fast and Accurate Bitwidth Reallocation during Training @jornpeters, @mfournarakis, Markus Nagel, @martvanbaalen, @TiRune https://t.co/NK1xuBiaps RCV workshop, Monday 2nd @ 15:20 (poster session, room S04)

1

0

535

Markus Nagel @mnagel87

over 2 years ago

@Tracing47202686 @yell1337 @TiRune Unlike with clipped softmax, to achieve an exact zero in the output using softmax1 for a (partial) no-update, the input requires to be -infinity. However, after @EvMill blog post we experimented with softmax1 and found it in practice competitive with our proposed approaches.

0

12

1

4K

Markus Nagel @mnagel87

over 2 years ago

I'm excited to share that our paper "Quantizable Transformers: Removing Outliers by Helping Attention Heads do Nothing" by @yell1337, @TiRune and myself has been accepted at #NeurIPS 2023! Paper: https://t.co/6kGeYdQ7Cq

2

60

6

14

4K

Markus Nagel @mnagel87

over 2 years ago

TL;DR: Transformers learn strong activation outliers making them difficult to quantize. We study their root cause and relate outliers to a no-op and partial update behavior. Our proposed clipped softmax and gated attention avoid outliers and make transformer easily quantizable.

0

6

1

342

Markus Nagel @mnagel87

over 2 years ago

TL;DR: We compare pruning and quantization analytically and empirically for various levels, on distributions, per-layer and for full neural networks with fine-tuning. Our results show that in most cases quantization outperforms pruning.

0

12

1

0

399

Markus Nagel @mnagel87

over 2 years ago

I'm excited to share that our paper "Pruning vs Quantization: Which is Better?" by Andrey Kuzmin, @martvanbaalen, Arash Behboodi, @TiRune and myself has been accepted at #NeurIPS 2023! Paper: https://t.co/fJXYQNQxwo

9

116

14

56

10K

Markus Nagel

@mnagel87

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users