We made Muon run up to 2x faster for free!
Introducing Gram Newton-Schulz: a mathematically equivalent but computationally faster Newton-Schulz algorithm for polar decomposition.
Gram Newton-Schulz rewrites Newton-Schulz such that instead of iterating on the expensive rectangular X matrix, we iterate on the small, square, symmetric XX^T Gram matrix to reduce FLOPs. This allows us to make more use of fast symmetric GEMM kernels on Hopper and Blackwell, halving the FLOPs of each of those GEMMs.
Gram Newton-Schulz is a drop-in replacement of Newton-Schulz for your Muon use case: we see validation perplexity preserved within 0.01, and share our (long!) journey stabilizing this algorithm and ensuring that training quality is preserved above all else.
This was a super fun project with @noahamsel, @berlinchen, and @tri_dao that spanned theory, numerical analysis, and ML systems! Blog and codebase linked below 🧵
🚨BREAKING AI NEWS 🚨
A Cambridge study just dropped that PROVES you can exactly calculate the slopes of functions at an arbitrary point. This UNLOCKS gradient optimization that experts say is vital for AGI.
Download our app for a daily AI digest delivered to your inbox
yep, definitely a big part of the challenge and the reason I'm starting with a fixed size font for now. I'm moving from pure frame prediction to delta prediction (over char-by-char rendnering) which will only require learning next position/shape of the letters on top of the actual latent text structure
big issue has been finding a good loss to capture the low freq position data and the high freq font-shape data at once
🥝 Meet Kimi K2.5, Open-Source Visual Agentic Intelligence.
🔹 Global SOTA on Agentic Benchmarks: HLE full set (50.2%), BrowseComp (74.9%)
🔹 Open-source SOTA on Vision and Coding: MMMU Pro (78.5%), VideoMMMU (86.6%), SWE-bench Verified (76.8%)
🔹 Code with Taste: turn chats, images & videos into aesthetic websites with expressive motion.
🔹 Agent Swarm (Beta): self-directed agents working in parallel, at scale. Up to 100 sub-agents, 1,500 tool calls, 4.5× faster compared with single-agent setup.
-
🥝 K2.5 is now live on https://t.co/YutVbwktG0 in chat mode and agent mode.
🥝 K2.5 Agent Swarm in beta for high-tier users.
🥝 For production-grade coding, you can pair K2.5 with Kimi Code: https://t.co/A5WQozJF3s
-
🔗 API: https://t.co/EOZkbOwCN4
🔗 Tech blog: https://t.co/6h2KkoA0xd
🔗 Weights & code: https://t.co/H38KegeDIY
ive found a key to complex systems dev with AI involves iterating heavily on a logging system as a first class citizen. that lets the AI reconcile complex logic against what actually happens (races, slowdowns, deadlocks etc)
stream based designs lend themselves well to that (https://t.co/jUVsh1KfwN)
Our new paper shows that RoPE—the positional encoding used in most modern LLMs like Qwen, Gemma, DeepSeek—has a fundamental flaw: it entangles "what" (content) and "where" (position) information.
Our fix (PoPE) is simple but powerful. Paper: https://t.co/XlltfcSwHQ
do we really need things like garbage collection in an AI-native world? how much of the overhead in high-level languages can just disappear with vibe-coded stacks?
zig is looking more appealing every day
github:bwasti/binfer -- an experiment with fast inference serving using bun + cuda.
trying to focus on speed / UX experiments, like language overhead and startup time. not trying to maintain this as a legit framework (use vllm or sglang)