We are doing really cool hard tech at @trymirai, but until recently our social media feeds were full of linkedinish cringe. We decided to fix it and share more technial content
I am currently working on our quantization pipeline, so here is a thread about LLM quantization
@mattcassinelli@tylerangert Oh for sure some cool non-MLX stuff this year. When Google released their on-device base model + LoRA, I crossed my fingers Apple would do the same. And they did!
Am actually curious now if anyone has shipped an AFM LoRA.
This ButterflyQuant paper looks neat, but also a little sus:
- no code
- no comparison against its closest relative (SpinQuant)
A good test project for coding agents?
Incoming new coremltools looks like it has some nice bits:
- 8 bit input/output tensors (previously all 8bit compute was kept internal)
- >1 input can be enumerated shapes (👀ANE)
🐙: https://t.co/M6F5EjzFrS
📄: https://t.co/aaPSq8iCmq
(R₅ is a rotation matrix, so its transpose is its inverse and it naturally cancels out in [email protected])
Turns out you don’t need R₅⁻¹ at all. 🫠 Fusing into Q and K is enough!
Cool paper from Qualcomm explains this and a few similar transforms.
No code in the paper, so gist proof👇
Liking the line of research where you multiply LLM weights by rotation matrices and the model still works.
Most do it in between layers, but you can also sneak one between Q/K and RoPE.
Extra parameters? None.
Useful? …Maybe.
Cool? I think so.
(See R₅ below.)
Curious about the Apple Foundation Model architecture? I updated my netron fork to visualize the draft model*.
*they say it might differ from the real model but looks convincing to me
See for yourself:
1. Get the adapter training toolkit: https://t.co/FBoPMOyOFd
2. Clone: https://t.co/KBnvGSQh5I
3. Edit https://t.co/NwMlsQrn2a:
- delete all functions except the first
- rename it to: func main<ios18>(
4. Follow readme to start netron, and open the .mil