Bram Wasti @bwasti - Twitter Profile

Bram Wasti

@bwasti

about 2 months ago

@cHHillee 4. make more layers share global attn - yoco/shared kv

1

2

0

1

426

bwasti retweeted

Jack Zhang

@jcz42

2 months ago

We made Muon run up to 2x faster for free! Introducing Gram Newton-Schulz: a mathematically equivalent but computationally faster Newton-Schulz algorithm for polar decomposition. Gram Newton-Schulz rewrites Newton-Schulz such that instead of iterating on the expensive rectangular X matrix, we iterate on the small, square, symmetric XX^T Gram matrix to reduce FLOPs. This allows us to make more use of fast symmetric GEMM kernels on Hopper and Blackwell, halving the FLOPs of each of those GEMMs. Gram Newton-Schulz is a drop-in replacement of Newton-Schulz for your Muon use case: we see validation perplexity preserved within 0.01, and share our (long!) journey stabilizing this algorithm and ensuring that training quality is preserved above all else. This was a super fun project with @noahamsel, @berlinchen, and @tri_dao that spanned theory, numerical analysis, and ML systems! Blog and codebase linked below 🧵

jcz42's tweet photo. We made Muon run up to 2x faster for free!

Introducing Gram Newton-Schulz: a mathematically equivalent but computationally faster Newton-Schulz algorithm for polar decomposition.

Gram Newton-Schulz rewrites Newton-Schulz such that instead of iterating on the expensive rectangular X matrix, we iterate on the small, square, symmetric XX^T Gram matrix to reduce FLOPs. This allows us to make more use of fast symmetric GEMM kernels on Hopper and Blackwell, halving the FLOPs of each of those GEMMs.

Gram Newton-Schulz is a drop-in replacement of Newton-Schulz for your Muon use case: we see validation perplexity preserved within 0.01, and share our (long!) journey stabilizing this algorithm and ensuring that training quality is preserved above all else.

This was a super fun project with @noahamsel, @berlinchen, and @tri_dao that spanned theory, numerical analysis, and ML systems! Blog and codebase linked below 🧵

17

1K

164

665

218K

bwasti retweeted

Arthur Spirling @arthur_spirling

3 months ago

🚨BREAKING AI NEWS 🚨 A Cambridge study just dropped that PROVES you can exactly calculate the slopes of functions at an arbitrary point. This UNLOCKS gradient optimization that experts say is vital for AGI. Download our app for a daily AI digest delivered to your inbox

arthur_spirling's tweet photo. 🚨BREAKING AI NEWS 🚨

A Cambridge study just dropped that PROVES you can exactly calculate the slopes of functions at an arbitrary point. This UNLOCKS gradient optimization that experts say is vital for AGI.

Download our app for a daily AI digest delivered to your inbox https://t.co/1kSZuFTH2X

32

1K

107

165

97K

Bram Wasti

@bwasti

3 months ago

gemini gets it tho. 👏

0

2

0

108

Who to follow

Mintlify

@mintlify

The intelligent knowledge platform

Rerun

@rerundotio

The data layer for physical AI. ⭐ GitHub https://t.co/yf1KZN7DBI 👾 Discord https://t.co/7PIlvsZO9n

Dan Zhang @ ICLR

@DZhang50

LLM Lead at Ricursive Intelligence | ex-Gemini @ Google DeepMind | Computer Architecture PhD @ UT Austin🤘 | Opinions stated here are my own.

Bram Wasti

@bwasti

3 months ago

pretty much no AI can classify a hand painted CIFAR-100 clock. a very under-explored dataset!

1

5

0

211

Bram Wasti

@bwasti

3 months ago

using vim to edit my prompt

0

6

0

182

Bram Wasti

@bwasti

4 months ago

when claude comes up with a root cause it already debunked three chat compressions ago

0

9

0

347

Bram Wasti

@bwasti

4 months ago

yep, definitely a big part of the challenge and the reason I'm starting with a fixed size font for now. I'm moving from pure frame prediction to delta prediction (over char-by-char rendnering) which will only require learning next position/shape of the letters on top of the actual latent text structure big issue has been finding a good loss to capture the low freq position data and the high freq font-shape data at once

0

14

Bram Wasti

@bwasti

5 months ago

new side project, stay tuned :)

1

12

1

3

798

bwasti retweeted

Kimi.ai @Kimi_Moonshot

5 months ago

🥝 Meet Kimi K2.5, Open-Source Visual Agentic Intelligence. 🔹 Global SOTA on Agentic Benchmarks: HLE full set (50.2%), BrowseComp (74.9%) 🔹 Open-source SOTA on Vision and Coding: MMMU Pro (78.5%), VideoMMMU (86.6%), SWE-bench Verified (76.8%) 🔹 Code with Taste: turn chats, images & videos into aesthetic websites with expressive motion. 🔹 Agent Swarm (Beta): self-directed agents working in parallel, at scale. Up to 100 sub-agents, 1,500 tool calls, 4.5× faster compared with single-agent setup. - 🥝 K2.5 is now live on https://t.co/YutVbwktG0 in chat mode and agent mode. 🥝 K2.5 Agent Swarm in beta for high-tier users. 🥝 For production-grade coding, you can pair K2.5 with Kimi Code: https://t.co/A5WQozJF3s - 🔗 API: https://t.co/EOZkbOwCN4 🔗 Tech blog: https://t.co/6h2KkoA0xd 🔗 Weights & code: https://t.co/H38KegeDIY

Kimi_Moonshot's tweet photo. 🥝 Meet Kimi K2.5, Open-Source Visual Agentic Intelligence.

🔹 Global SOTA on Agentic Benchmarks: HLE full set (50.2%), BrowseComp (74.9%)
🔹 Open-source SOTA on Vision and Coding: MMMU Pro (78.5%), VideoMMMU (86.6%), SWE-bench Verified (76.8%)
🔹 Code with Taste: turn chats, images & videos into aesthetic websites with expressive motion.
🔹 Agent Swarm (Beta): self-directed agents working in parallel, at scale. Up to 100 sub-agents, 1,500 tool calls, 4.5× faster compared with single-agent setup.
-
🥝 K2.5 is now live on https://t.co/YutVbwktG0 in chat mode and agent mode.
🥝 K2.5 Agent Swarm in beta for high-tier users.
🥝 For production-grade coding, you can pair K2.5 with Kimi Code: https://t.co/A5WQozJF3s
-
🔗 API: https://t.co/EOZkbOwCN4
🔗 Tech blog: https://t.co/6h2KkoA0xd
🔗 Weights & code: https://t.co/H38KegeDIY

774

16K

2K

10K

7M

Bram Wasti

@bwasti

5 months ago

@rustyryan still playing with the exact setup, but directly over pixels means an output dim of 16k which isn’t terrible at first glance…

1

0

33

Bram Wasti

@bwasti

5 months ago

ive found a key to complex systems dev with AI involves iterating heavily on a logging system as a first class citizen. that lets the AI reconcile complex logic against what actually happens (races, slowdowns, deadlocks etc) stream based designs lend themselves well to that (https://t.co/jUVsh1KfwN)

0

1

0

78

Bram Wasti

@bwasti

5 months ago

contributing to a short talk at the vllm office hours! https://t.co/cAAakJAqeU

0

4

0

294

Bram Wasti

@bwasti

5 months ago

would be so nice to have claude venvs. `source my_claude_env/activate` without having to check your dir for the configs/mds cc @bcherny

1

3

0

322

bwasti retweeted

tender

@tenderizzation

5 months ago

prefill with a single token be like

3

123

2

18

10K

Bram Wasti

@bwasti

5 months ago

@drisspg @typedfemale how does it compare to nick + sons?

1

0

100

bwasti retweeted

Anand Gopalakrishnan @agopal42

6 months ago

Our new paper shows that RoPE—the positional encoding used in most modern LLMs like Qwen, Gemma, DeepSeek—has a fundamental flaw: it entangles "what" (content) and "where" (position) information. Our fix (PoPE) is simple but powerful. Paper: https://t.co/XlltfcSwHQ

30

1K

177

1K

156K

Bram Wasti

@bwasti

6 months ago

@CSProfKGD I remember implementing HOG features in SIMD (SSE2!) and then again in pure javascript (before web assembly)

0

1

0

173

Bram Wasti

@bwasti

6 months ago

do we really need things like garbage collection in an AI-native world? how much of the overhead in high-level languages can just disappear with vibe-coded stacks? zig is looking more appealing every day

1

2

0

318

Bram Wasti

@bwasti

6 months ago

github:bwasti/binfer -- an experiment with fast inference serving using bun + cuda. trying to focus on speed / UX experiments, like language overhead and startup time. not trying to maintain this as a legit framework (use vllm or sglang)

0

6

0

1

832

Bram Wasti

@bwasti

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users