Willie Neiswanger @willieneis - Twitter Profile

Willie Neiswanger

@willieneis

4 months ago

@haozhangml Congrats Hao!!

0

1

0

129

willieneis retweeted

𝚐𝔪𝟾𝚡𝚡𝟾

@gm8xx8

8 months ago

Tina proved that LoRA can match or surpass full-parameter RL. Tora builds directly on that result, turning it into a full framework. Built on torchtune, it extends RL post-training to LoRA, QLoRA, DoRA, and QDoRA under one interface with GRPO, FSDP, and compile support. QLoRA and QDoRA enable 4-bit RL with stable rewards, while DoRA-Cache speeds rollouts by 2–4× under the same setup. Tora establishes a clean, scalable baseline for LoRA in RL post-training. ⮕ 𝐥𝐢𝐧𝐤 𝐛𝐞𝐥𝐨𝐰

gm8xx8's tweet photo. Tina proved that LoRA can match or surpass full-parameter RL. Tora builds directly on that result, turning it into a full framework.

Built on torchtune, it extends RL post-training to LoRA, QLoRA, DoRA, and QDoRA under one interface with GRPO, FSDP, and compile support. QLoRA and QDoRA enable 4-bit RL with stable rewards, while DoRA-Cache speeds rollouts by 2–4× under the same setup.

Tora establishes a clean, scalable baseline for LoRA in RL post-training.

⮕ 𝐥𝐢𝐧𝐤 𝐛𝐞𝐥𝐨𝐰

3

299

28

214

30K

willieneis retweeted

Johnny Tian-Zheng Wei @johntzwei

7 months ago

Announcing 🔭✨Hubble, a suite of open-source LLMs to advance the study of memorization! Pretrained models up to 8B params, with controlled insertion of texts (e.g., book passages, biographies, test sets, and more!) designed to emulate key memorization risks 🧵

johntzwei's tweet photo. Announcing 🔭✨Hubble, a suite of open-source LLMs to advance the study of memorization!

Pretrained models up to 8B params, with controlled insertion of texts (e.g., book passages, biographies, test sets, and more!) designed to emulate key memorization risks 🧵 https://t.co/07K2A2uIbv

2

131

41

52

50K

Willie Neiswanger

@willieneis

8 months ago

Links — Repo: https://t.co/IDYScsaJf9 Tina Paper: https://t.co/v3nkvrx84J

0

5

0

1

515

Who to follow

Pavel Izmailov

@Pavel_Izmailov

Researcher @AnthropicAI 🤖 Assistant Professor @nyuniversity 🏙️ Previously @OpenAI #StopWar 🇺🇦

Zico Kolter

@zicokolter

Professor and Head of Machine Learning Department at @CarnegieMellon. Board member @OpenAI and @Qualcomm. Chief Scientist @GraySwanAI.

Patrik Reizinger

@rpatrik96

🇭🇺 🇪🇺 ML researcher @MPI_IS, @ELLISforEurope | Causal representation learning | Building research tools | Newsletter: https://t.co/TPP2SvAvqr

Willie Neiswanger

@willieneis

8 months ago

It was great to see @thinkymachines LoRA w/o Regret blog, which connects nicely to our work on Tina (LoRA for RL). For wider use, we’re releasing a clean implementation of RL with LoRA, DoRA, QLoRA/QDoRA, plus speedups & more, across models from 1.5B–32B. Nice work @UpupWang!

Shangshang Wang @UpupWang

8 months ago

We now know that LoRA can match full-parameter RL training (from https://t.co/pGxoMLFIGv and our Tina paper https://t.co/dkXdxV3eNj), but what about DoRA, QLoRA, and more? We are releasing a clean LoRA-for-RL repo to explore them all. https://t.co/AsWWG1kmKt

UpupWang's tweet photo. We now know that LoRA can match full-parameter RL training (from https://t.co/pGxoMLFIGv and our Tina paper https://t.co/dkXdxV3eNj), but what about DoRA, QLoRA, and more?

We are releasing a clean LoRA-for-RL repo to explore them all.

https://t.co/AsWWG1kmKt https://t.co/8CbOfZuEZw

13

560

69

403

67K

2

23

1

6

4K

Willie Neiswanger

@willieneis

10 months ago

@shengjia_zhao Awesome, congrats Shengjia!!

0

4

0

420

willieneis retweeted

Shangshang Wang @UpupWang

12 months ago

Sparse autoencoders (SAEs) can be used to elicit strong reasoning abilities with remarkable efficiency. Using only 1 hour of training at $2 cost without any reasoning traces, we find a way to train 1.5B models via SAEs to score 43.33% Pass@1 on AIME24 and 90% Pass@1 on AMC23.

UpupWang's tweet photo. Sparse autoencoders (SAEs) can be used to elicit strong reasoning abilities with remarkable efficiency.

Using only 1 hour of training at $2 cost without any reasoning traces, we find a way to train 1.5B models via SAEs to score 43.33% Pass@1 on AIME24 and 90% Pass@1 on AMC23. https://t.co/4E69IU3m1K

10

497

55

493

72K

willieneis retweeted

Deqing Fu

@DeqingFu

about 1 year ago

Textual steering vectors can improve visual understanding in multimodal LLMs! You can extract steering vectors via any interpretability toolkit you like -- SAEs, MeanShift, Probes -- and apply them to image or text tokens (or both) of Multimodal LLMs. And They Steer!

DeqingFu's tweet photo. Textual steering vectors can improve visual understanding in multimodal LLMs!

You can extract steering vectors via any interpretability toolkit you like -- SAEs, MeanShift, Probes -- and apply them to image or text tokens (or both) of Multimodal LLMs.
And They Steer!

1

53

14

10K

willieneis retweeted

Sebastian Raschka

@rasbt

about 1 year ago

Is LoRA (Low Rank Adaptation) relevant in 2025 for reasoning models? I recently read "Tina: Tiny Reasoning Models via LoRA (https://t.co/rIlj7amWd4)", and it made me pause for a moment: when was the last time I heard someone excitedly talk/write about LoRA? LoRA (Low-Rank Adaptation) was one of the most influential fine-tuning methods in the earlier LLM boom (as you may remember, I wrote about it a lot in recent years). The idea is simple but effective: avoid full model updates and instead inject a small number of trainable parameters for downstream tasks. This drastically reduces memory and compute costs. But in the age of ever-larger instruction-tuned models coupled with well-working distillation techniques (like popularized by DeepSeek-R1 etc), LoRA seemed to become more irrelevant recently. What about LoRA work for developing reasoning models? This paper tackles exactly that question. Instead of the usual supervised fine-tuning or instruction distillation pipeline, the authors use LoRA with reinforcement learning (RL) to improve reasoning capabilities. Specifically, they fine-tune a 1.5B base model using LoRA adapters while applying RL on reasoning benchmarks. Their baseline model is DeepSeek-R1-Distill-Qwen-1.5B, which is a model already fine-tuned for reasoning tasks. (I wish they started with the base Qwen-1.5B model; but this way, I guess they have more comparisons with other methods that further trained the DeepSeek-R1-Distill-Qwen-1.5B.) From there, the authors ran experiments across datasets, learning rates, LoRA ranks, and RL algorithms. Their best-performing model was trained on just 7k examples and cost just $9 to train. Even with hyperparameter sweeps and multiple ablations, the entire study cost just $526. So, how well does LoRA work? The top half of the results figure (highlighted in blue) compares models trained with LoRA-based RL versus standard RL (i.e., no LoRA). On every benchmark (AIME24, AIME25, AMC23, MATH500, GPAQ, Minerva), LoRA outperforms the regular RL baseline when applied to the same starting model. Insights from ablations 1) Surprisingly, the best-performing model came from the smallest dataset: just 7k examples from Open-RS. 2) The classic LoRA rank 16 emerged as the sweet spot, but ranks 8 and 32 also worked well. 3) It's nice that they included the recent Dr. GRPO (I recently discussed it in my latest Ahead of AI blog). It substantially reduces training time by length-normalizing rewards and addressing issues in GRPO Bottom line: Reasoning is certainly an interesting use case, and it's interesting (and a bit surprising) that LoRA does so well here. It might also be the first case where I've seen LoRA coupled with RL, which is another interesting aspect. LoRA certainly peaked in popularity 1-2 years ago, and more people now consider (more expensive) full-parameter updates (based on anecdotal perception); there's still a place for LoRA and LoRA-like methods. Let's not forget that one of the key advantages of LoRA is that it doesn't modify the underlying base model. This is key in applications where you either have lots of specialized use cases or lots of customers. For example, instead of storing 100 1B full-parameter tuned models, it would be much cheaper to store a 32B model with 100 sets of LoRA weights.

rasbt's tweet photo. Is LoRA (Low Rank Adaptation) relevant in 2025 for reasoning models?

I recently read "Tina: Tiny Reasoning Models via LoRA (https://t.co/rIlj7amWd4)", and it made me pause for a moment: when was the last time I heard someone excitedly talk/write about LoRA?

LoRA (Low-Rank Adaptation) was one of the most influential fine-tuning methods in the earlier LLM boom (as you may remember, I wrote about it a lot in recent years). The idea is simple but effective: avoid full model updates and instead inject a small number of trainable parameters for downstream tasks. This drastically reduces memory and compute costs. But in the age of ever-larger instruction-tuned models coupled with well-working distillation techniques (like popularized by DeepSeek-R1 etc), LoRA seemed to become more irrelevant recently.

What about LoRA work for developing reasoning models?

This paper tackles exactly that question. Instead of the usual supervised fine-tuning or instruction distillation pipeline, the authors use LoRA with reinforcement learning (RL) to improve reasoning capabilities. Specifically, they fine-tune a 1.5B base model using LoRA adapters while applying RL on reasoning benchmarks.

Their baseline model is DeepSeek-R1-Distill-Qwen-1.5B, which is a model already fine-tuned for reasoning tasks. (I wish they started with the base Qwen-1.5B model; but this way, I guess they have more comparisons with other methods that further trained the DeepSeek-R1-Distill-Qwen-1.5B.)

From there, the authors ran experiments across datasets, learning rates, LoRA ranks, and RL algorithms. Their best-performing model was trained on just 7k examples and cost just $9 to train. Even with hyperparameter sweeps and multiple ablations, the entire study cost just $526.

So, how well does LoRA work?

The top half of the results figure (highlighted in blue) compares models trained with LoRA-based RL versus standard RL (i.e., no LoRA). On every benchmark (AIME24, AIME25, AMC23, MATH500, GPAQ, Minerva), LoRA outperforms the regular RL baseline when applied to the same starting model.

Insights from ablations

1) Surprisingly, the best-performing model came from the smallest dataset: just 7k examples from Open-RS.

2) The classic LoRA rank 16 emerged as the sweet spot, but ranks 8 and 32 also worked well.

3) It's nice that they included the recent Dr. GRPO (I recently discussed it in my latest Ahead of AI blog). It substantially reduces training time by length-normalizing rewards and addressing issues in GRPO

Bottom line:

Reasoning is certainly an interesting use case, and it's interesting (and a bit surprising) that LoRA does so well here. It might also be the first case where I've seen LoRA coupled with RL, which is another interesting aspect.

LoRA certainly peaked in popularity 1-2 years ago, and more people now consider (more expensive) full-parameter updates (based on anecdotal perception); there's still a place for LoRA and LoRA-like methods.
Let's not forget that one of the key advantages of LoRA is that it doesn't modify the underlying base model. This is key in applications where you either have lots of specialized use cases or lots of customers. For example, instead of storing 100 1B full-parameter tuned models, it would be much cheaper to store a 32B model with 100 sets of LoRA weights.

26

974

171

784

63K

willieneis retweeted

Ollie Liu

@olliezliu

about 1 year ago

Presenting our spotlight paper on LLMs for decision making at @iclr_conf, Apr 25, 10–12:30PM, Hall 3 #113. Come say hi!

0

23

5

1

4K

willieneis retweeted

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

about 1 year ago

Tina: Tiny Reasoning Models via LoRA "the best Tina model achieves a >20% reasoning performance increase and 43.33% Pass@1 accuracy on AIME24, at only $9 USD post-training and evaluation cost (i.e., an estimated 260x cost reduction). Our work reveals the surprising effectiveness of efficient RL reasoning via LoRA."

iScienceLuvr's tweet photo. Tina: Tiny Reasoning Models via LoRA

"the best Tina model achieves a >20% reasoning performance increase and 43.33% Pass@1 accuracy on AIME24, at only $9 USD post-training and evaluation cost (i.e., an estimated 260x cost reduction). Our work reveals the surprising effectiveness of efficient RL reasoning via LoRA."

13

750

136

589

61K

willieneis retweeted

Shangshang Wang @UpupWang

about 1 year ago

😋 Want strong LLM reasoning without breaking the bank? We explored just how cost-effectively RL can enhance reasoning using LoRA! [1/9] Introducing Tina: A family of tiny reasoning models with strong performance at low cost, providing an accessible testbed for RL reasoning. 🧵

UpupWang's tweet photo. 😋 Want strong LLM reasoning without breaking the bank? We explored just how cost-effectively RL can enhance reasoning using LoRA!

[1/9] Introducing Tina: A family of tiny reasoning models with strong performance at low cost, providing an accessible testbed for RL reasoning. 🧵 https://t.co/zciQLTR0tV

2

398

66

356

44K

Willie Neiswanger

@willieneis

about 1 year ago

@krandiash Congrats!

0

1

0

143

Willie Neiswanger

@willieneis

over 1 year ago

@volokuleshov Congrats!

1

0

433

Willie Neiswanger

@willieneis

over 1 year ago

@StefanoErmon Congrats!

0

1

0

425

Willie Neiswanger

@willieneis

over 1 year ago

@adityagrover_ Congrats!

0

282

Willie Neiswanger

@willieneis

over 1 year ago

An awesome set of resources on LLM reasoning and test-time compute, compiled by @UpUpWang — check it out!

Shangshang Wang @UpupWang

over 1 year ago

🔍 Diving deep into LLM reasoning? From OpenAI's o-series to DeepSeek R1, from post-training to test-time compute — we break it down into structured spreadsheets. 🧵

UpupWang's tweet photo. 🔍 Diving deep into LLM reasoning?

From OpenAI's o-series to DeepSeek R1, from post-training to test-time compute — we break it down into structured spreadsheets. 🧵 https://t.co/4SoLCGUb3d

1

20

4

16

4K

0

10

0

5

1K

willieneis retweeted

Jiarui Zhang (Jerry)

@JiaruiZ58876329

over 1 year ago

[1/11] Many recent studies have shown that current multimodal LLMs (MLLMs) struggle with low-level visual perception (LLVP) — the ability to precisely describe the fine-grained/geometric details of an image. How can we do better? Introducing Euclid, our first study at improving MLLM’s LLVP. We show that with proper architecture & training choices, even small MLLMs can learn strong and generalizable LLVP, surpassing the best proprietary models!

JiaruiZ58876329's tweet photo. [1/11] Many recent studies have shown that current multimodal LLMs (MLLMs) struggle with low-level visual perception (LLVP) — the ability to precisely describe the fine-grained/geometric details of an image.

How can we do better?

Introducing Euclid, our first study at improving MLLM’s LLVP. We show that with proper architecture & training choices, even small MLLMs can learn strong and generalizable LLVP, surpassing the best proprietary models!

1

20

5

6

3K

Willie Neiswanger

@willieneis

over 1 year ago

@ben_lengerich @USC @PrimeIntellect Thanks Ben!

0

109

Willie Neiswanger

@willieneis

over 1 year ago

Excited to release METAGENE-1, a 7B parameter metagenomic foundation model, built to aid in pathogen detection & pandemic monitoring. Pretrained on 1.5 trillion base pairs of DNA/RNA sequenced from wastewater. A collab w/ @USC, @PrimeIntellect, & the Nucleic Acid Observatory. 🧵

4

117

23

40

13K

Willie Neiswanger

@willieneis

over 1 year ago

@anthonygitter @PrimeIntellect @tatta_bio Thanks for the links! Yes, these are great ideas for next steps.

0

1

0

95

Willie Neiswanger

@willieneis

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users