William Hu @_williamhu - Twitter Profile

Pinned Tweet

7 months ago

AI is compute-hungry. While it has generally relied on a single hardware vendor in the past, AMD GPUs now offer competitive memory and compute throughput. Yet, the software stack is brittle. So we ask: can the same DSL principles that simplified NVIDIA kernel dev translate to AMD? We’re excited to introduce the newest addition to the ThunderKittens cinematic universe of kernel DSLs: HipKittens (HK) 🚀for Fast and Furious AMD kernels.

_williamhu's tweet photo. AI is compute-hungry. While it has generally relied on a single hardware vendor in the past, AMD GPUs now offer competitive memory and compute throughput. Yet, the software stack is brittle.

So we ask: can the same DSL principles that simplified NVIDIA kernel dev translate to AMD? We’re excited to introduce the newest addition to the ThunderKittens cinematic universe of kernel DSLs: HipKittens (HK) 🚀for Fast and Furious AMD kernels.

7

157

35

29

53K

William Hu

@_williamhu

about 2 months ago

@levidiamode @HazyResearch Let me know if you have any questions!

1

6

0

268

_williamhu retweeted

Stuart Sul

@stuart_sul

2 months ago

Happy to share new ThunderKittens attention kernels for B300 GPUs -- faster than FA4! Check it out:

2

151

13

34

15K

William Hu

@_williamhu

3 months ago

Modal's been an amazing abstraction to work with! Excited to be building it out more! 🚀

Charles 🎉 Frye

@charles_irl

3 months ago

Fresh blog post! @modal partnered with @ScalingIntelLab, @HazyResearch, and @chelseabfinn's IRIS Lab to speed up research on speeding up AI research. Read how scientists at the cutting edge are building the machines that build the machines with Modal. https://t.co/FDbnp8DOuC

5

186

18

124

51K

2

11

2

0

2K

_williamhu retweeted

Flapping Airplanes

@flappyairplanes

4 months ago

Announcing Flapping Airplanes! We’ve raised $180M from GV, Sequoia, and Index to assemble a new guard in AI: one that imagines a world where models can think at human level without ingesting half the internet.

338

4K

257

1K

2M

_williamhu retweeted

alex zhang

@a1zhang

5 months ago

Much like the switch in 2025 from language models to reasoning models, we think 2026 will be all about the switch to Recursive Language Models (RLMs). It turns out that models can be far more powerful if you allow them to treat *their own prompts* as an object in an external environment, which they understand and manipulate by writing code that invokes LLMs! Our full paper on RLMs is now available—with much more expansive experiments compared to our initial blogpost from October 2025! https://t.co/x47pIfIkTb

a1zhang's tweet photo. Much like the switch in 2025 from language models to reasoning models, we think 2026 will be all about the switch to Recursive Language Models (RLMs).

It turns out that models can be far more powerful if you allow them to treat *their own prompts* as an object in an external environment, which they understand and manipulate by writing code that invokes LLMs!

Our full paper on RLMs is now available—with much more expansive experiments compared to our initial blogpost from October 2025!

https://t.co/x47pIfIkTb

251

7K

1K

7K

2M

_williamhu retweeted

Owen Dugan @OwenDugan

6 months ago

Happy 🦃 Thanksgiving weekend! 🍂 This year, we cooked up a new recipe for juicy fact-storing MLPs. Instead of picking apart trained models, we asked: Can we construct fact-storing MLPs from scratch? 🤔 Spoiler: we can & we figured out how to slot these hand-crafted MLPs into Transformer blocks as modular fact stores! 🧩 New work with @garctrob @ronnygjunkins @jerrywliu @dylan_zinsley @EyubogluSabri Atri Rudra @HazyResearch! 🧵👇

OwenDugan's tweet photo. Happy 🦃 Thanksgiving weekend! 🍂 This year, we cooked up a new recipe for juicy fact-storing MLPs. Instead of picking apart trained models, we asked: Can we construct fact-storing MLPs from scratch? 🤔

Spoiler: we can & we figured out how to slot these hand-crafted MLPs into Transformer blocks as modular fact stores! 🧩

New work with @garctrob @ronnygjunkins @jerrywliu @dylan_zinsley @EyubogluSabri Atri Rudra @HazyResearch!
🧵👇

8

339

47

246

65K

_williamhu retweeted

Simran Arora

@simran_s_arora

7 months ago

Super excited for ParallelKittens led by @stuart_sul! From the Nvidia A100 to the B200, BF16 tensor core performance improved by 7.2× and High Bandwidth Memory bandwidth by 5.1×, while intra-node communication (NVLink) improved by only 3× and inter-node (PCIe/InfiniBand) by just 2×. PK helps with multi-gpu kernels!

4

192

22

88

33K

_williamhu retweeted

Stuart Sul

@stuart_sul

7 months ago

(1/6) GPU networking is the remaining AI efficiency bottleneck, and the underlying hardware is changing fast! We’re happy to release ParallelKittens, an update to ThunderKittens that lets you easily write fast computation-communication overlapped multi-GPU kernels, along with new kernels for data, tensor, sequence, and expert parallelism! Here’s a photo of overlapped kittens, along with things you should care about when optimizing multi-GPU kernels. (With @simran_s_arora, @bfspector, and @hazyresearch. Generously supported by @cursor_ai and @togethercompute)

stuart_sul's tweet photo. (1/6) GPU networking is the remaining AI efficiency bottleneck, and the underlying hardware is changing fast! We’re happy to release ParallelKittens, an update to ThunderKittens that lets you easily write fast computation-communication overlapped multi-GPU kernels, along with new kernels for data, tensor, sequence, and expert parallelism!

Here’s a photo of overlapped kittens, along with things you should care about when optimizing multi-GPU kernels.

(With @simran_s_arora, @bfspector, and @hazyresearch. Generously supported by @cursor_ai and @togethercompute)

9

514

59

505

156K

William Hu

@_williamhu

7 months ago

@hhua_ Excited to keep growing HK! Doing some exploration on my end too :)

0

1

0

1

94

William Hu

@_williamhu

7 months ago

AI is compute-hungry. While it has generally relied on a single hardware vendor in the past, AMD GPUs now offer competitive memory and compute throughput. Yet, the software stack is brittle. So we ask: can the same DSL principles that simplified NVIDIA kernel dev translate to AMD? We’re excited to introduce the newest addition to the ThunderKittens cinematic universe of kernel DSLs: HipKittens (HK) 🚀for Fast and Furious AMD kernels.

7

157

35

29

53K

William Hu

@_williamhu

7 months ago

👀

Simran Arora

@simran_s_arora

7 months ago

exciting! https://t.co/qKNB4CWU8H

3

333

18

121

55K

0

14

0

2K

_williamhu retweeted

the tiny corp

@__tinygrad__

7 months ago

A great deep dive into CDNA4 here. https://t.co/npw7mIcur4

1

51

2

21

8K

William Hu

@_williamhu

7 months ago

@TDevilfish Dynamic reg alloc will probably enable wave specialization to be more of a thing then! Curious about how the matrix layouts and memory access patterns for wave32 mode will differ too.

0

1

0

49

William Hu

@_williamhu

7 months ago

AI is multi-silicon 🚀

AI at AMD

@AIatAMD

7 months ago

AI is compute hungry, so the @HazyResearch team at @Stanford asked: How do we build AI from the hardware up? How do we lead developers to do what the hardware prefers? This technical deep dive on HipKittens explores how optimized register tiles, wave-level scheduling, and chiplet-aware cache reuse help unlock the full potential of AMD GPUs. 🐱 Dig into the details: https://t.co/lw4cw4YZgh #AMDevs

AIatAMD's tweet photo. AI is compute hungry, so the @HazyResearch team at @Stanford asked: How do we build AI from the hardware up? How do we lead developers to do what the hardware prefers?

This technical deep dive on HipKittens explores how optimized register tiles, wave-level scheduling, and chiplet-aware cache reuse help unlock the full potential of AMD GPUs.

🐱 Dig into the details: https://t.co/lw4cw4YZgh

#AMDevs

0

68

7

26

14K

0

22

2

4

4K

_williamhu retweeted

Simran Arora

@simran_s_arora

7 months ago

We had a great time previewing HipKittens at AMD Dev Day a few weeks ago!!

7

215

8

21

27K

_williamhu retweeted

AMDGPU @AMDGPU_

7 months ago

Computer Scientists at Stanford University achieve breakthrough AI performance on AMDs MI355x GPUs.

2

125

11

38

14K

_williamhu retweeted

AI at AMD

@AIatAMD

7 months ago

HipKittens is here. A new stack of fast, readable AMD GPU kernels built for real performance and real developer velocity. Check it out on the @HazyResearch blog: https://t.co/zoeAB7Ujfs #AMDevs

AIatAMD's tweet photo. HipKittens is here.

A new stack of fast, readable AMD GPU kernels built for real performance and real developer velocity.

Check it out on the @HazyResearch blog: https://t.co/zoeAB7Ujfs

#AMDevs https://t.co/joFggkCujd

0

55

7

12K

William Hu

@_williamhu

7 months ago

When your ArXiv submission goes from on hold to accepted [https://t.co/0Nd01UB1do]. There’s a pun at the end of the introduction if anyone spots it 👀

2

18

3

1

2K

_williamhu retweeted

Jon Saad-Falcon

@JonSaadFalcon

7 months ago

Data centers dominate AI, but they're hitting physical limits. What if the future of AI isn't just bigger data centers, but local intelligence in our hands? The viability of local AI depends on intelligence efficiency. To measure this, we propose intelligence per watt (IPW): intelligence delivered (capabilities) per unit of power consumed (efficiency). Today’s Local LMs already handle 88.7% of single-turn chat and reasoning queries, with local IPW improving 5.3× in 2 years—driven by better models (3.2×) and better accelerators (1.7×). As local IPW improves, a meaningful fraction of workloads can shift from centralized infrastructure to local compute, with IPW serving as the critical metric for tracking this transition. (1/N)

$JonSaadFalcon's tweet photo. Data centers dominate AI, but they're hitting physical limits. What if the future of AI isn't just bigger data centers, but local intelligence in our hands? The viability of local AI depends on intelligence efficiency. To measure this, we propose intelligence per watt (IPW): intelligence delivered (capabilities) per unit of power consumed (efficiency). Today’s Local LMs already handle 88.7% of single-turn chat and reasoning queries, with local IPW improving 5.3× in 2 years—driven by better models (3.2×) and better accelerators (1.7×). As local IPW improves, a meaningful fraction of workloads can shift from centralized infrastructure to local compute, with IPW serving as the critical metric for tracking this transition. (1/N)$

56

462

142

178

229K

_williamhu retweeted

SemiAnalysis

@SemiAnalysis_

7 months ago

Stanford used to be an NVIDIA stronghold, it even has a building named after Jensen, the Jen-Hsun Huang Engineering Center. But it seems AMD is starting to gain traction in Stanford’s research labs now, with experimental ROCm support in ThunderKittens. NVIDIA will need to contribute more compute to Stanford research if it wants Stanford to remain an NVIDIA stronghold.

14

445

31

130

164K

William Hu

@_williamhu

Last Seen Users on Sotwe

Trends for you

Most Popular Users