Cafe Cursor Colombo 🇱🇰
Wrapped up in style!
Colombo was filled with builders on Saturday!
See you all again! In another Cursor event!
Thank you @benln & @ftnabeelah & @cursor_ai for all the immense support!
I have been fine-tuning LLMs for over 2 years now!
Here are the top 5 LLM fine-tuning techniques, explained with visuals:
First of all, what's so different about LLM finetuning?
Traditional fine‑tuning is impractical for LLMs (billions of params; 100s GB).
Since this kind of compute isn't accessible to everyone, parameter-efficient finetuning (PEFT) came into existence.
Before we go into details of each technique, here's some background that will help you better understand these techniques:
LLM weights are matrices of numbers adjusted during finetuning.
Most PEFT techniques involve finding a lower-rank adaptation of these matrices, a smaller-dimensional matrix that can still represent the information stored in the original.
Now with a basic understanding of the rank of a matrix, we're in a good position to understand the different finetuning techniques.
(refer to the image below for a visual explanation of each technique)
1) LoRA
- Add two low-rank trainable matrices, A and B, alongside weight matrices.
- Instead of fine-tuning W, adjust the updates in these low-rank matrices.
Even for the largest of LLMs, LoRA matrices take up a few MBs of memory.
2) LoRA-FA
While LoRA significantly decreases the total trainable parameters, it requires substantial activation memory to update the low-rank weights.
LoRA-FA (FA stands for Frozen-A) freezes matrix A and only updates matrix B.
3) VeRA
- In LoRA, low-rank matrices A and B are unique for each layer.
- In VeRA, A and B are frozen, random, and shared across all layers.
- Instead, it learns layer-specific scaling VECTORS (b and d) instead.
4) Delta-LoRA
- It tunes the matrix W as well, but not in the traditional way.
- Here, the difference (or delta) between the product of matrices A and B in two consecutive training steps is added to W.
5) LoRA+
- In LoRA, both matrices A and B are updated with the same learning rate.
- Authors of LoRA+ found that setting a higher learning rate for matrix B results in better convergence.
____
Find me → @_avichawla
Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs.
This interesting week started with DeepSeek V3.2!
I just wrote up a technical tour of the predecessors and components that led up to this:
🔗 https://t.co/JSAd9cx2s6
- Multi-Head Latent Attention
- RLVR
- Sparse Attention
- Self-Verification
- GRPO Updates
@minchoi Isn't it Laixi Screen Monitoring Software? It's nothing that new maybe a little more intelligent in doing tasks now, who knows but this is like really old.
@karpathy Completed. Just Wow. Well it's God damn freaking Andrej Karpathy after all. Thank you so much. Can't remember watching a video more than 45 min on YouTube before.