I am excited to host and present at our #CVPR2025 tutorial on Power-efficient Neural Networks Using Low-precision Data Types and Quantization. This is together with @TiRune (Meta) and Thomas Pfeil (Recogni).
📅 Thursday Jun 12, 13:00
🏠 CVPR, room 205B
ℹ️ https://t.co/ROdi9S7dF0
Proud to present our work on optimizing Mixture of Experts models for on-device generation speed: https://t.co/QrsTpgjwAN
We introduce a cache-aware routing that boosts memory efficiency of commonly used MoEs, improving generation throughput by 2×—all without retraining. Perfect for real-world, memory-constrained devices.
This is a joint work with wonderful team here at Qualcomm: @tivaro@r_lepert@BabakEht Todor Boinovski @mnagel87@martvanbaalen and Paul Whatmough
Are you pursuing a PhD and are you interested in working on efficiency of LLMs/LVMs? Then join our model efficiency team in #QualcommAIResearch for an internship!
Apply below, we have openings for 2025 as well as autumn/winter 2024.
https://t.co/vNrUT9wzrB
Interested in boosting quantized LLM performance with QAT?
Check out our latest work on Low-Rank Quantization-Aware Training (LR-QAT) which can train 7B LLMs on a single consumer-grade GPU with just 24GB of memory.
New work with @yell1337 and @delchia
https://t.co/xFkLltbghM
Sparse High Rank Adapters
abs: https://t.co/Cyh7GRUS7S
"In this paper, we propose Sparse High Rank Adapters (SHiRA), a new paradigm which incurs no inference overhead, enables rapid switching, and significantly reduces concept-loss. Specifically, SHiRA can be trained by directly tuning only 1-2% of the base model weights while leaving others unchanged."
Qualcomm presents GPTVQ
The Blessing of Dimensionality for LLM Quantization
show that the size versus accuracy trade-off of neural network quantization can be significantly improved by increasing the quantization dimensionality. We propose the GPTVQ method, a new fast method for post-training vector quantization (VQ) that scales well to Large Language Models (LLMs). Our method interleaves quantization of one or more columns with updates to the remaining unquantized weights, using information from the Hessian of the per-layer output reconstruction MSE. Quantization codebooks are initialized using an efficient data-aware version of the EM algorithm. The codebooks are then updated, and further compressed by using integer quantization and SVD-based compression. GPTVQ establishes a new state-of-the art in the size vs accuracy trade-offs on a wide range of LLMs such as Llama-v2 and Mistral. Furthermore, our method is efficient: on a single H100 it takes between 3 and 11 hours to process a Llamav2-70B model, depending on quantization setting. Lastly, with on-device timings for VQ decompression on a mobile CPU we show that VQ leads to improved latency compared to using a 4-bit integer format.
⭐️New paper ⭐️ Excited to share 'The LLM Surgeon', accepted at ICLR 2024. We obtain SOTA pruning performance and even demonstrate structured LLM pruning of full rows and cols. Direct practical impact enabling compression up to 20-30% with negligible loss in performance.🧵1/9👇
Today on the podcast we feature our conversation with @mnagel87 from @Qualcomm AI Research to discuss his accepted papers at NeurIPS along with other papers and demos presented by the @QCOMResearch team.
🎧/🎥: https://t.co/lzmcFurgUm.
The approach of removing outliers, as emphasized in Markus’ Quantizable Transformers paper, is not a quantization method but rather an approach to address and eliminate the root cause of activation quantization issues.
Catch @mnagel87’s episode at https://t.co/6VsXlpejTP.
If you are around and want to connect, please send me a PM. You can also find me at our poster sessions or the Qualcomm booth. Poster sessions:
- Pruning vs Quantization: Tuesday 17:15-19:15
- Quantizable Transformers: Thursday 17:00-19:00
This week I’m at #NeurIPS2023 to present our recent model efficiency research:
1) Pruning vs Quantization: Which is Better?, w/ A Kuzmin, @martvanbaalen, A Behboodi, @TiRune
2) Quantizable Transformers: Removing Outliers by Helping Attention Heads do Nothing, w/ @yell1337,@TiRune
Read about our #NeurIPS2023 plans including four accepted demos on fast stable diffusion, on-device learning (ODL), fast #AI assistant, and generative relighting, as well as six accepted machine learning papers. See you soon in New Orleans! https://t.co/iI15R0UaMt
This week I'm in Paris at #ICCV2023 to present some of our recent work on model efficiency and quantization. Please join me for our talks and posters or at the Qualcomm booth.
(1/4, schedule follows)
QBitOpt: Fast and Accurate Bitwidth Reallocation during Training
@jornpeters, @mfournarakis, Markus Nagel, @martvanbaalen, @TiRune
https://t.co/NK1xuBiaps
RCV workshop, Monday 2nd @ 15:20 (poster session, room S04)
@Tracing47202686@yell1337@TiRune Unlike with clipped softmax, to achieve an exact zero in the output using softmax1 for a (partial) no-update, the input requires to be -infinity. However, after @EvMill blog post we experimented with softmax1 and found it in practice competitive with our proposed approaches.
I'm excited to share that our paper "Quantizable Transformers: Removing Outliers by Helping Attention Heads do Nothing" by @yell1337, @TiRune
and myself has been accepted at #NeurIPS 2023!
Paper: https://t.co/6kGeYdQ7Cq
TL;DR: Transformers learn strong activation outliers making them difficult to quantize. We study their root cause and relate outliers to a no-op and partial update behavior. Our proposed clipped softmax and gated attention avoid outliers and make transformer easily quantizable.
TL;DR: We compare pruning and quantization analytically and empirically for various levels, on distributions, per-layer and for full neural networks with fine-tuning. Our results show that in most cases quantization outperforms pruning.
I'm excited to share that our paper "Pruning vs Quantization: Which is Better?" by Andrey Kuzmin, @martvanbaalen, Arash Behboodi, @TiRune and myself has been accepted at #NeurIPS 2023!
Paper: https://t.co/fJXYQNQxwo