We made Muon run up to 2x faster for free!
Introducing Gram Newton-Schulz: a mathematically equivalent but computationally faster Newton-Schulz algorithm for polar decomposition.
Gram Newton-Schulz rewrites Newton-Schulz such that instead of iterating on the expensive rectangular X matrix, we iterate on the small, square, symmetric XX^T Gram matrix to reduce FLOPs. This allows us to make more use of fast symmetric GEMM kernels on Hopper and Blackwell, halving the FLOPs of each of those GEMMs.
Gram Newton-Schulz is a drop-in replacement of Newton-Schulz for your Muon use case: we see validation perplexity preserved within 0.01, and share our (long!) journey stabilizing this algorithm and ensuring that training quality is preserved above all else.
This was a super fun project with @noahamsel, @berlinchen, and @tri_dao that spanned theory, numerical analysis, and ML systems! Blog and codebase linked below 🧵
Had a great time working with the ARC Prize team testing r1!
r1 seems to get o1-preview perf and is cheap too — another great OSS model in the ecosystem
thanks to @mikeknoop@GregKamradt@ishanit5 for this opportunity!
It's time to revisit common assumptions in IR! Embeddings have improved drastically, but mainstream IR evals have stagnated since MSMARCO and BEIR.
We ask: on private or tricky IR tasks, are current rerankers even better? Surely, reranking as many docs as you can afford is best?
Want to train inference-friendly models that use less memory and have higher throughput? We show that KV cache sharing between layers and adding sliding window layers can speed up inference while maintaining model quality. https://t.co/x83VqACq2h
Excited to announce our new work: Critique-out-Loud (CLoud) reward models. CLoud reward models first produce a chain of thought critique of the input before predicting a scalar reward, allowing reward models to reason explicitly instead of implicitly!
https://t.co/CnYEDM36no
Pretraining data ablations are expensive: how can we measure data quality fast and cheap?
If you're at ICML, come find out at the ES-FoMo poster session today in Lehar 2 at 1 pm: https://t.co/pAMIWU7n0S
I’m going to be at ICML 2024 presenting a workshop paper on 3D video modeling! Down to talk about anything LLMs, multimodality / new CV modalities, and ML systems!
We are excited to announce Vid3D, a technique for generating 3D video using only 2D video diffusion models and Gaussian splatting!
Paper: https://t.co/RnbnyRZHJU
Github: https://t.co/ZmYJEe6hOb
Project Page: https://t.co/gYQXnb9xkX