Tina proved that LoRA can match or surpass full-parameter RL. Tora builds directly on that result, turning it into a full framework.
Built on torchtune, it extends RL post-training to LoRA, QLoRA, DoRA, and QDoRA under one interface with GRPO, FSDP, and compile support. QLoRA and QDoRA enable 4-bit RL with stable rewards, while DoRA-Cache speeds rollouts by 2–4× under the same setup.
Tora establishes a clean, scalable baseline for LoRA in RL post-training.
⮕ 𝐥𝐢𝐧𝐤 𝐛𝐞𝐥𝐨𝐰
Announcing 🔭✨Hubble, a suite of open-source LLMs to advance the study of memorization!
Pretrained models up to 8B params, with controlled insertion of texts (e.g., book passages, biographies, test sets, and more!) designed to emulate key memorization risks 🧵
It was great to see @thinkymachines LoRA w/o Regret blog, which connects nicely to our work on Tina (LoRA for RL).
For wider use, we’re releasing a clean implementation of RL with LoRA, DoRA, QLoRA/QDoRA, plus speedups & more, across models from 1.5B–32B.
Nice work @UpupWang!
We now know that LoRA can match full-parameter RL training (from https://t.co/pGxoMLFIGv and our Tina paper https://t.co/dkXdxV3eNj), but what about DoRA, QLoRA, and more?
We are releasing a clean LoRA-for-RL repo to explore them all.
https://t.co/AsWWG1kmKt
Sparse autoencoders (SAEs) can be used to elicit strong reasoning abilities with remarkable efficiency.
Using only 1 hour of training at $2 cost without any reasoning traces, we find a way to train 1.5B models via SAEs to score 43.33% Pass@1 on AIME24 and 90% Pass@1 on AMC23.
Textual steering vectors can improve visual understanding in multimodal LLMs!
You can extract steering vectors via any interpretability toolkit you like -- SAEs, MeanShift, Probes -- and apply them to image or text tokens (or both) of Multimodal LLMs.
And They Steer!
Is LoRA (Low Rank Adaptation) relevant in 2025 for reasoning models?
I recently read "Tina: Tiny Reasoning Models via LoRA (https://t.co/rIlj7amWd4)", and it made me pause for a moment: when was the last time I heard someone excitedly talk/write about LoRA?
LoRA (Low-Rank Adaptation) was one of the most influential fine-tuning methods in the earlier LLM boom (as you may remember, I wrote about it a lot in recent years). The idea is simple but effective: avoid full model updates and instead inject a small number of trainable parameters for downstream tasks. This drastically reduces memory and compute costs. But in the age of ever-larger instruction-tuned models coupled with well-working distillation techniques (like popularized by DeepSeek-R1 etc), LoRA seemed to become more irrelevant recently.
What about LoRA work for developing reasoning models?
This paper tackles exactly that question. Instead of the usual supervised fine-tuning or instruction distillation pipeline, the authors use LoRA with reinforcement learning (RL) to improve reasoning capabilities. Specifically, they fine-tune a 1.5B base model using LoRA adapters while applying RL on reasoning benchmarks.
Their baseline model is DeepSeek-R1-Distill-Qwen-1.5B, which is a model already fine-tuned for reasoning tasks. (I wish they started with the base Qwen-1.5B model; but this way, I guess they have more comparisons with other methods that further trained the DeepSeek-R1-Distill-Qwen-1.5B.)
From there, the authors ran experiments across datasets, learning rates, LoRA ranks, and RL algorithms. Their best-performing model was trained on just 7k examples and cost just $9 to train. Even with hyperparameter sweeps and multiple ablations, the entire study cost just $526.
So, how well does LoRA work?
The top half of the results figure (highlighted in blue) compares models trained with LoRA-based RL versus standard RL (i.e., no LoRA). On every benchmark (AIME24, AIME25, AMC23, MATH500, GPAQ, Minerva), LoRA outperforms the regular RL baseline when applied to the same starting model.
Insights from ablations
1) Surprisingly, the best-performing model came from the smallest dataset: just 7k examples from Open-RS.
2) The classic LoRA rank 16 emerged as the sweet spot, but ranks 8 and 32 also worked well.
3) It's nice that they included the recent Dr. GRPO (I recently discussed it in my latest Ahead of AI blog). It substantially reduces training time by length-normalizing rewards and addressing issues in GRPO
Bottom line:
Reasoning is certainly an interesting use case, and it's interesting (and a bit surprising) that LoRA does so well here. It might also be the first case where I've seen LoRA coupled with RL, which is another interesting aspect.
LoRA certainly peaked in popularity 1-2 years ago, and more people now consider (more expensive) full-parameter updates (based on anecdotal perception); there's still a place for LoRA and LoRA-like methods.
Let's not forget that one of the key advantages of LoRA is that it doesn't modify the underlying base model. This is key in applications where you either have lots of specialized use cases or lots of customers. For example, instead of storing 100 1B full-parameter tuned models, it would be much cheaper to store a 32B model with 100 sets of LoRA weights.
Tina: Tiny Reasoning Models via LoRA
"the best Tina model achieves a >20% reasoning performance increase and 43.33% Pass@1 accuracy on AIME24, at only $9 USD post-training and evaluation cost (i.e., an estimated 260x cost reduction). Our work reveals the surprising effectiveness of efficient RL reasoning via LoRA."
😋 Want strong LLM reasoning without breaking the bank? We explored just how cost-effectively RL can enhance reasoning using LoRA!
[1/9] Introducing Tina: A family of tiny reasoning models with strong performance at low cost, providing an accessible testbed for RL reasoning. 🧵
🔍 Diving deep into LLM reasoning?
From OpenAI's o-series to DeepSeek R1, from post-training to test-time compute — we break it down into structured spreadsheets. 🧵
[1/11] Many recent studies have shown that current multimodal LLMs (MLLMs) struggle with low-level visual perception (LLVP) — the ability to precisely describe the fine-grained/geometric details of an image.
How can we do better?
Introducing Euclid, our first study at improving MLLM’s LLVP. We show that with proper architecture & training choices, even small MLLMs can learn strong and generalizable LLVP, surpassing the best proprietary models!
Excited to release METAGENE-1, a 7B parameter metagenomic foundation model, built to aid in pathogen detection & pandemic monitoring. Pretrained on 1.5 trillion base pairs of DNA/RNA sequenced from wastewater.
A collab w/ @USC, @PrimeIntellect, & the Nucleic Acid Observatory. 🧵