levi

Verified account

@levidiamode

365 days of GPU programming ▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░░░░░░░░░░░░░░░ 146/365

out of memory

Joined June 2025

647 Following

4.8K Followers

481 Posts

Pinned Tweet

2 months ago

Day 83/365 of GPU Programming Looking at DeepSeek's Multi-Head Latent Attention today. The last part of the AMD challenge series is to optimize an MLA decode kernel for MI355X where the absorbed Q and compressed KV cache are given and your task is to do the attention computation. A resource that really helped internalize what MLA does was @rasbt's incredible visual guide to attention variants in LLMs (luckily he posted that last week!), which covers everything from MHA to GQA to MLA to SWA, et cetera. If there's one place to get a visual intuition for recent attention mechanisms, it's this blog post. @jbhuang0604's video on MQA, GQA,MLA and DSA was the best conceptual intro I found on the topic and progressively builds up the ideas from first principles. The Welch Labs analysis of MLA is a great watch as well. Beautiful visualization of the changes DeepSeek made for MLA. Tried out a few kernels once I had a basic understanding of MLA and I think I'm slowly getting more comfortable with at least analyzing kernels.

levidiamode's tweet photo. Day 83/365 of GPU Programming

Looking at DeepSeek's Multi-Head Latent Attention today. The last part of the AMD challenge series is to optimize an MLA decode kernel for MI355X where the absorbed Q and compressed KV cache are given and your task is to do the attention computation.

A resource that really helped internalize what MLA does was @rasbt's incredible visual guide to attention variants in LLMs (luckily he posted that last week!), which covers everything from MHA to GQA to MLA to SWA, et cetera. If there's one place to get a visual intuition for recent attention mechanisms, it's this blog post.

@jbhuang0604's video on MQA, GQA,MLA and DSA was the best conceptual intro I found on the topic and progressively builds up the ideas from first principles.

The Welch Labs analysis of MLA is a great watch as well. Beautiful visualization of the changes DeepSeek made for MLA.

Tried out a few kernels once I had a basic understanding of MLA and I think I'm slowly getting more comfortable with at least analyzing kernels.

levidiamode's tweet photo. Day 83/365 of GPU Programming

Looking at DeepSeek's Multi-Head Latent Attention today. The last part of the AMD challenge series is to optimize an MLA decode kernel for MI355X where the absorbed Q and compressed KV cache are given and your task is to do the attention computation.

A resource that really helped internalize what MLA does was @rasbt's incredible visual guide to attention variants in LLMs (luckily he posted that last week!), which covers everything from MHA to GQA to MLA to SWA, et cetera. If there's one place to get a visual intuition for recent attention mechanisms, it's this blog post.

@jbhuang0604's video on MQA, GQA,MLA and DSA was the best conceptual intro I found on the topic and progressively builds up the ideas from first principles.

The Welch Labs analysis of MLA is a great watch as well. Beautiful visualization of the changes DeepSeek made for MLA.

Tried out a few kernels once I had a basic understanding of MLA and I think I'm slowly getting more comfortable with at least analyzing kernels.

levidiamode's tweet photo. Day 83/365 of GPU Programming

Looking at DeepSeek's Multi-Head Latent Attention today. The last part of the AMD challenge series is to optimize an MLA decode kernel for MI355X where the absorbed Q and compressed KV cache are given and your task is to do the attention computation.

A resource that really helped internalize what MLA does was @rasbt's incredible visual guide to attention variants in LLMs (luckily he posted that last week!), which covers everything from MHA to GQA to MLA to SWA, et cetera. If there's one place to get a visual intuition for recent attention mechanisms, it's this blog post.

@jbhuang0604's video on MQA, GQA,MLA and DSA was the best conceptual intro I found on the topic and progressively builds up the ideas from first principles.

The Welch Labs analysis of MLA is a great watch as well. Beautiful visualization of the changes DeepSeek made for MLA.

Tried out a few kernels once I had a basic understanding of MLA and I think I'm slowly getting more comfortable with at least analyzing kernels.

levidiamode's tweet photo. Day 83/365 of GPU Programming

Looking at DeepSeek's Multi-Head Latent Attention today. The last part of the AMD challenge series is to optimize an MLA decode kernel for MI355X where the absorbed Q and compressed KV cache are given and your task is to do the attention computation.

A resource that really helped internalize what MLA does was @rasbt's incredible visual guide to attention variants in LLMs (luckily he posted that last week!), which covers everything from MHA to GQA to MLA to SWA, et cetera. If there's one place to get a visual intuition for recent attention mechanisms, it's this blog post.

@jbhuang0604's video on MQA, GQA,MLA and DSA was the best conceptual intro I found on the topic and progressively builds up the ideas from first principles.

The Welch Labs analysis of MLA is a great watch as well. Beautiful visualization of the changes DeepSeek made for MLA.

Tried out a few kernels once I had a basic understanding of MLA and I think I'm slowly getting more comfortable with at least analyzing kernels.

2 months ago

Day 82/365 of GPU Programming Taking a closer look at Mixture of Experts today, so I can write better MoE kernels. Specifically, to optimize an MXFP4 MoE fused kernel for the GPU Mode challenge. I haven't had much prior exposure to MoEs, so lots of new concepts I learned today. Luckily I found the best intro to MoEs thanks to @MaartenGr visual overview of the topic. I then watched @tatsu_hashimoto's amazing Stanford CS336 lecture on MoEs, which added deeper context around why MoEs are gaining popularity, FLOPs, OLMoE, infra complexity, routing functions (mindblown this works so well...), expert sizes, training objectives, top k routing and DeepSeek variations. Once I had a basic understanding I started playing around with the some AITER kernels but progress there is tbd. Also had a nice chat with @juscallmevyom (who was kind enough to reach out!) about the AMD kernels and the challenge of materialization overhead.

levidiamode's tweet photo. Day 82/365 of GPU Programming

Taking a closer look at Mixture of Experts today, so I can write better MoE kernels. Specifically, to optimize an MXFP4 MoE fused kernel for the GPU Mode challenge.

I haven't had much prior exposure to MoEs, so lots of new concepts I learned today. Luckily I found the best intro to MoEs thanks to @MaartenGr visual overview of the topic.

I then watched @tatsu_hashimoto's amazing Stanford CS336 lecture on MoEs, which added deeper context around why MoEs are gaining popularity, FLOPs, OLMoE, infra complexity, routing functions (mindblown this works so well...), expert sizes, training objectives, top k routing and DeepSeek variations.

Once I had a basic understanding I started playing around with the some AITER kernels but progress there is tbd.

Also had a nice chat with @juscallmevyom (who was kind enough to reach out!) about the AMD kernels and the challenge of materialization overhead.

levidiamode's tweet photo. Day 82/365 of GPU Programming

Taking a closer look at Mixture of Experts today, so I can write better MoE kernels. Specifically, to optimize an MXFP4 MoE fused kernel for the GPU Mode challenge.

I haven't had much prior exposure to MoEs, so lots of new concepts I learned today. Luckily I found the best intro to MoEs thanks to @MaartenGr visual overview of the topic.

I then watched @tatsu_hashimoto's amazing Stanford CS336 lecture on MoEs, which added deeper context around why MoEs are gaining popularity, FLOPs, OLMoE, infra complexity, routing functions (mindblown this works so well...), expert sizes, training objectives, top k routing and DeepSeek variations.

Once I had a basic understanding I started playing around with the some AITER kernels but progress there is tbd.

Also had a nice chat with @juscallmevyom (who was kind enough to reach out!) about the AMD kernels and the challenge of materialization overhead.

levidiamode's tweet photo. Day 82/365 of GPU Programming

Taking a closer look at Mixture of Experts today, so I can write better MoE kernels. Specifically, to optimize an MXFP4 MoE fused kernel for the GPU Mode challenge.

I haven't had much prior exposure to MoEs, so lots of new concepts I learned today. Luckily I found the best intro to MoEs thanks to @MaartenGr visual overview of the topic.

I then watched @tatsu_hashimoto's amazing Stanford CS336 lecture on MoEs, which added deeper context around why MoEs are gaining popularity, FLOPs, OLMoE, infra complexity, routing functions (mindblown this works so well...), expert sizes, training objectives, top k routing and DeepSeek variations.

Once I had a basic understanding I started playing around with the some AITER kernels but progress there is tbd.

Also had a nice chat with @juscallmevyom (who was kind enough to reach out!) about the AMD kernels and the challenge of materialization overhead.

levidiamode's tweet photo. Day 82/365 of GPU Programming

Taking a closer look at Mixture of Experts today, so I can write better MoE kernels. Specifically, to optimize an MXFP4 MoE fused kernel for the GPU Mode challenge.

I haven't had much prior exposure to MoEs, so lots of new concepts I learned today. Luckily I found the best intro to MoEs thanks to @MaartenGr visual overview of the topic.

I then watched @tatsu_hashimoto's amazing Stanford CS336 lecture on MoEs, which added deeper context around why MoEs are gaining popularity, FLOPs, OLMoE, infra complexity, routing functions (mindblown this works so well...), expert sizes, training objectives, top k routing and DeepSeek variations.

Once I had a basic understanding I started playing around with the some AITER kernels but progress there is tbd.

Also had a nice chat with @juscallmevyom (who was kind enough to reach out!) about the AMD kernels and the challenge of materialization overhead.

9

541

42

516

145K

21

1K

146

1K

117K

about 13 hours ago

151/365 of GPU Programming Finishing up edits and running a few followup experiments the next few days to see if I can extend the workshop idea to a more general setting that would be practically useful + viable for a full conference paper. Been also playing around with switching back and forth between smaller local experiments and scaling to larger GPUs via services like @modal and @runpod (also want to try @fal). Super convenient I have to say but can also get quite expensive quickly, so always need to double check whether a job I'm sending off is really worth it.

levidiamode's tweet photo. 151/365 of GPU Programming

Finishing up edits and running a few followup experiments the next few days to see if I can extend the workshop idea to a more general setting that would be practically useful + viable for a full conference paper.

Been also playing around with switching back and forth between smaller local experiments and scaling to larger GPUs via services like @modal and @runpod (also want to try @fal). Super convenient I have to say but can also get quite expensive quickly, so always need to double check whether a job I'm sending off is really worth it.

1 day ago

150/365 of GPU Programming Doing a review of my workshop paper today and going through a few edits for the camera ready version, so it's in good shape by the deadline. On that note, I really enjoyed @srush_nlp's talk on How to Write an okay Research Paper . A nice, pragmatic intro to ok paper writing that makes the most of the space you have available inside the page limit. Really recommend giving it a watch if you're new to research like me.

levidiamode's tweet photo. 150/365 of GPU Programming

Doing a review of my workshop paper today and going through a few edits for the camera ready version, so it's in good shape by the deadline.

On that note, I really enjoyed @srush_nlp's talk on How to Write an okay Research Paper . A nice, pragmatic intro to ok paper writing that makes the most of the space you have available inside the page limit.

Really recommend giving it a watch if you're new to research like me.

levidiamode's tweet photo. 150/365 of GPU Programming

Doing a review of my workshop paper today and going through a few edits for the camera ready version, so it's in good shape by the deadline.

On that note, I really enjoyed @srush_nlp's talk on How to Write an okay Research Paper . A nice, pragmatic intro to ok paper writing that makes the most of the space you have available inside the page limit.

Really recommend giving it a watch if you're new to research like me.

levidiamode's tweet photo. 150/365 of GPU Programming

Doing a review of my workshop paper today and going through a few edits for the camera ready version, so it's in good shape by the deadline.

On that note, I really enjoyed @srush_nlp's talk on How to Write an okay Research Paper . A nice, pragmatic intro to ok paper writing that makes the most of the space you have available inside the page limit.

Really recommend giving it a watch if you're new to research like me.

levidiamode's tweet photo. 150/365 of GPU Programming

Doing a review of my workshop paper today and going through a few edits for the camera ready version, so it's in good shape by the deadline.

On that note, I really enjoyed @srush_nlp's talk on How to Write an okay Research Paper . A nice, pragmatic intro to ok paper writing that makes the most of the space you have available inside the page limit.

Really recommend giving it a watch if you're new to research like me.

2

97

3

60

4K

1

33

1

12

1K

1 day ago

@srush_nlp - Link to talk: https://t.co/0HTnTM4wlj - Link to slides: https://t.co/KgLJulfyMn

0

7

1

3

416

1 day ago

150/365 of GPU Programming Doing a review of my workshop paper today and going through a few edits for the camera ready version, so it's in good shape by the deadline. On that note, I really enjoyed @srush_nlp's talk on How to Write an okay Research Paper . A nice, pragmatic intro to ok paper writing that makes the most of the space you have available inside the page limit. Really recommend giving it a watch if you're new to research like me.

levidiamode's tweet photo. 150/365 of GPU Programming

Doing a review of my workshop paper today and going through a few edits for the camera ready version, so it's in good shape by the deadline.

On that note, I really enjoyed @srush_nlp's talk on How to Write an okay Research Paper . A nice, pragmatic intro to ok paper writing that makes the most of the space you have available inside the page limit.

Really recommend giving it a watch if you're new to research like me.

levidiamode's tweet photo. 150/365 of GPU Programming

Doing a review of my workshop paper today and going through a few edits for the camera ready version, so it's in good shape by the deadline.

On that note, I really enjoyed @srush_nlp's talk on How to Write an okay Research Paper . A nice, pragmatic intro to ok paper writing that makes the most of the space you have available inside the page limit.

Really recommend giving it a watch if you're new to research like me.

levidiamode's tweet photo. 150/365 of GPU Programming

Doing a review of my workshop paper today and going through a few edits for the camera ready version, so it's in good shape by the deadline.

On that note, I really enjoyed @srush_nlp's talk on How to Write an okay Research Paper . A nice, pragmatic intro to ok paper writing that makes the most of the space you have available inside the page limit.

Really recommend giving it a watch if you're new to research like me.

levidiamode's tweet photo. 150/365 of GPU Programming

Doing a review of my workshop paper today and going through a few edits for the camera ready version, so it's in good shape by the deadline.

On that note, I really enjoyed @srush_nlp's talk on How to Write an okay Research Paper . A nice, pragmatic intro to ok paper writing that makes the most of the space you have available inside the page limit.

Really recommend giving it a watch if you're new to research like me.

3 days ago

149/365 of GPU Programming Been seeing a lot of people talking DFlash recently, so wanted to spend time today learning more about what it is. Luckily stumbled across Eleuther's ML Performance Reading group (organized by @vega_myhre), which had a nice introduction to the topic. While reading the paper, I'm also trying to see how I can pair DFlash with the local inference optimizations I'm experimenting with recently.

levidiamode's tweet photo. 149/365 of GPU Programming

Been seeing a lot of people talking DFlash recently, so wanted to spend time today learning more about what it is.

Luckily stumbled across Eleuther's ML Performance Reading group (organized by @vega_myhre), which had a nice introduction to the topic.

While reading the paper, I'm also trying to see how I can pair DFlash with the local inference optimizations I'm experimenting with recently.

levidiamode's tweet photo. 149/365 of GPU Programming

Been seeing a lot of people talking DFlash recently, so wanted to spend time today learning more about what it is.

Luckily stumbled across Eleuther's ML Performance Reading group (organized by @vega_myhre), which had a nice introduction to the topic.

While reading the paper, I'm also trying to see how I can pair DFlash with the local inference optimizations I'm experimenting with recently.

levidiamode's tweet photo. 149/365 of GPU Programming

Been seeing a lot of people talking DFlash recently, so wanted to spend time today learning more about what it is.

Luckily stumbled across Eleuther's ML Performance Reading group (organized by @vega_myhre), which had a nice introduction to the topic.

While reading the paper, I'm also trying to see how I can pair DFlash with the local inference optimizations I'm experimenting with recently.

levidiamode's tweet photo. 149/365 of GPU Programming

Been seeing a lot of people talking DFlash recently, so wanted to spend time today learning more about what it is.

Luckily stumbled across Eleuther's ML Performance Reading group (organized by @vega_myhre), which had a nice introduction to the topic.

While reading the paper, I'm also trying to see how I can pair DFlash with the local inference optimizations I'm experimenting with recently.

3

94

3

59

7K

2

97

3

60

4K

2 days ago

@guojing0 @vega_myhre both luckily! recently got a used nvidia setup locally and use a bit of cloud here and there when needed

1

1

0

0

35

3 days ago

149/365 of GPU Programming Been seeing a lot of people talking DFlash recently, so wanted to spend time today learning more about what it is. Luckily stumbled across Eleuther's ML Performance Reading group (organized by @vega_myhre), which had a nice introduction to the topic. While reading the paper, I'm also trying to see how I can pair DFlash with the local inference optimizations I'm experimenting with recently.

levidiamode's tweet photo. 149/365 of GPU Programming

Been seeing a lot of people talking DFlash recently, so wanted to spend time today learning more about what it is.

Luckily stumbled across Eleuther's ML Performance Reading group (organized by @vega_myhre), which had a nice introduction to the topic.

While reading the paper, I'm also trying to see how I can pair DFlash with the local inference optimizations I'm experimenting with recently.

levidiamode's tweet photo. 149/365 of GPU Programming

Been seeing a lot of people talking DFlash recently, so wanted to spend time today learning more about what it is.

Luckily stumbled across Eleuther's ML Performance Reading group (organized by @vega_myhre), which had a nice introduction to the topic.

While reading the paper, I'm also trying to see how I can pair DFlash with the local inference optimizations I'm experimenting with recently.

levidiamode's tweet photo. 149/365 of GPU Programming

Been seeing a lot of people talking DFlash recently, so wanted to spend time today learning more about what it is.

Luckily stumbled across Eleuther's ML Performance Reading group (organized by @vega_myhre), which had a nice introduction to the topic.

While reading the paper, I'm also trying to see how I can pair DFlash with the local inference optimizations I'm experimenting with recently.

levidiamode's tweet photo. 149/365 of GPU Programming

Been seeing a lot of people talking DFlash recently, so wanted to spend time today learning more about what it is.

Luckily stumbled across Eleuther's ML Performance Reading group (organized by @vega_myhre), which had a nice introduction to the topic.

While reading the paper, I'm also trying to see how I can pair DFlash with the local inference optimizations I'm experimenting with recently.

4 days ago

148/365 of GPU Programming Going through the scaling law lecture of CS336 today and I find it more dense than previous materials. Will need to spend some additional time over the next few days reviewing the resources from the slides and attached papers to get more comfortable with the topic. I'm starting to understand what scaling laws are and why they're useful w.r.t compute/data but still trying to get how critical batch sizes, muP, model-data joint errors, etc are derived, the primary differences between Kaplan vs Chinchilla and train-optimal vs inference-optimal. One nice thing about this particular lecture is that it doesn't just cover the recent scaling law papers but also goes back to the early history of scaling laws like Cortes et al 1991, Branko and Brill '01, and so on.

levidiamode's tweet photo. 148/365 of GPU Programming

Going through the scaling law lecture of CS336 today and I find it more dense than previous materials. Will need to spend some additional time over the next few days reviewing the resources from the slides and attached papers to get more comfortable with the topic.

I'm starting to understand what scaling laws are and why they're useful w.r.t compute/data but still trying to get how critical batch sizes, muP, model-data joint errors, etc are derived, the primary differences between Kaplan vs Chinchilla and train-optimal vs inference-optimal.

One nice thing about this particular lecture is that it doesn't just cover the recent scaling law papers but also goes back to the early history of scaling laws like Cortes et al 1991, Branko and Brill '01, and so on.

levidiamode's tweet photo. 148/365 of GPU Programming

Going through the scaling law lecture of CS336 today and I find it more dense than previous materials. Will need to spend some additional time over the next few days reviewing the resources from the slides and attached papers to get more comfortable with the topic.

I'm starting to understand what scaling laws are and why they're useful w.r.t compute/data but still trying to get how critical batch sizes, muP, model-data joint errors, etc are derived, the primary differences between Kaplan vs Chinchilla and train-optimal vs inference-optimal.

One nice thing about this particular lecture is that it doesn't just cover the recent scaling law papers but also goes back to the early history of scaling laws like Cortes et al 1991, Branko and Brill '01, and so on.

levidiamode's tweet photo. 148/365 of GPU Programming

Going through the scaling law lecture of CS336 today and I find it more dense than previous materials. Will need to spend some additional time over the next few days reviewing the resources from the slides and attached papers to get more comfortable with the topic.

I'm starting to understand what scaling laws are and why they're useful w.r.t compute/data but still trying to get how critical batch sizes, muP, model-data joint errors, etc are derived, the primary differences between Kaplan vs Chinchilla and train-optimal vs inference-optimal.

One nice thing about this particular lecture is that it doesn't just cover the recent scaling law papers but also goes back to the early history of scaling laws like Cortes et al 1991, Branko and Brill '01, and so on.

levidiamode's tweet photo. 148/365 of GPU Programming

Going through the scaling law lecture of CS336 today and I find it more dense than previous materials. Will need to spend some additional time over the next few days reviewing the resources from the slides and attached papers to get more comfortable with the topic.

I'm starting to understand what scaling laws are and why they're useful w.r.t compute/data but still trying to get how critical batch sizes, muP, model-data joint errors, etc are derived, the primary differences between Kaplan vs Chinchilla and train-optimal vs inference-optimal.

One nice thing about this particular lecture is that it doesn't just cover the recent scaling law papers but also goes back to the early history of scaling laws like Cortes et al 1991, Branko and Brill '01, and so on.

1

56

6

32

9K

3

94

3

59

7K

levidiamode retweeted

Edward Z. Yang @ezyang

3 days ago

New devlog post from yours truly: When does fragmentation occur in the CUDA caching allocator? https://t.co/ocAdv4mjy2 -- this post is LLM authored but I heavily prompted/edited, and Natalia also helped fact check.

8

136

14

85

11K

3 days ago

@vega_myhre - Link to ML performance reading group recoding: https://t.co/Mazl2ssoKw - Link to paper: https://t.co/fy2BqXHajw - Link to repo: https://t.co/rW5ndUtlmT - Link to blog: https://t.co/sIVvDHn9Cy

0

4

2

9

699

4 days ago

148/365 of GPU Programming Going through the scaling law lecture of CS336 today and I find it more dense than previous materials. Will need to spend some additional time over the next few days reviewing the resources from the slides and attached papers to get more comfortable with the topic. I'm starting to understand what scaling laws are and why they're useful w.r.t compute/data but still trying to get how critical batch sizes, muP, model-data joint errors, etc are derived, the primary differences between Kaplan vs Chinchilla and train-optimal vs inference-optimal. One nice thing about this particular lecture is that it doesn't just cover the recent scaling law papers but also goes back to the early history of scaling laws like Cortes et al 1991, Branko and Brill '01, and so on.

levidiamode's tweet photo. 148/365 of GPU Programming

Going through the scaling law lecture of CS336 today and I find it more dense than previous materials. Will need to spend some additional time over the next few days reviewing the resources from the slides and attached papers to get more comfortable with the topic.

I'm starting to understand what scaling laws are and why they're useful w.r.t compute/data but still trying to get how critical batch sizes, muP, model-data joint errors, etc are derived, the primary differences between Kaplan vs Chinchilla and train-optimal vs inference-optimal.

One nice thing about this particular lecture is that it doesn't just cover the recent scaling law papers but also goes back to the early history of scaling laws like Cortes et al 1991, Branko and Brill '01, and so on.

levidiamode's tweet photo. 148/365 of GPU Programming

Going through the scaling law lecture of CS336 today and I find it more dense than previous materials. Will need to spend some additional time over the next few days reviewing the resources from the slides and attached papers to get more comfortable with the topic.

I'm starting to understand what scaling laws are and why they're useful w.r.t compute/data but still trying to get how critical batch sizes, muP, model-data joint errors, etc are derived, the primary differences between Kaplan vs Chinchilla and train-optimal vs inference-optimal.

One nice thing about this particular lecture is that it doesn't just cover the recent scaling law papers but also goes back to the early history of scaling laws like Cortes et al 1991, Branko and Brill '01, and so on.

levidiamode's tweet photo. 148/365 of GPU Programming

Going through the scaling law lecture of CS336 today and I find it more dense than previous materials. Will need to spend some additional time over the next few days reviewing the resources from the slides and attached papers to get more comfortable with the topic.

I'm starting to understand what scaling laws are and why they're useful w.r.t compute/data but still trying to get how critical batch sizes, muP, model-data joint errors, etc are derived, the primary differences between Kaplan vs Chinchilla and train-optimal vs inference-optimal.

One nice thing about this particular lecture is that it doesn't just cover the recent scaling law papers but also goes back to the early history of scaling laws like Cortes et al 1991, Branko and Brill '01, and so on.

levidiamode's tweet photo. 148/365 of GPU Programming

Going through the scaling law lecture of CS336 today and I find it more dense than previous materials. Will need to spend some additional time over the next few days reviewing the resources from the slides and attached papers to get more comfortable with the topic.

I'm starting to understand what scaling laws are and why they're useful w.r.t compute/data but still trying to get how critical batch sizes, muP, model-data joint errors, etc are derived, the primary differences between Kaplan vs Chinchilla and train-optimal vs inference-optimal.

One nice thing about this particular lecture is that it doesn't just cover the recent scaling law papers but also goes back to the early history of scaling laws like Cortes et al 1991, Branko and Brill '01, and so on.

5 days ago

147/365 of GPU Programming Iterating on my auto inference dashboard and adding a few functionalities that have helped me get a better intuition for what is being tested and why. Added a knowledge graph that connects experiments with learnings, a live queue that reorders based on incoming data and provides context around future experiments and active hypotheses with summaries on what is currently being validated vs ruled out. From there, I find it much easier to dive deeper into specific hypotheses/experiments and helps not getting lost w.r.t where the system is and what it's converging on.

levidiamode's tweet photo. 147/365 of GPU Programming

Iterating on my auto inference dashboard and adding a few functionalities that have helped me get a better intuition for what is being tested and why.

Added a knowledge graph that connects experiments with learnings, a live queue that reorders based on incoming data and provides context around future experiments and active hypotheses with summaries on what is currently being validated vs ruled out.

From there, I find it much easier to dive deeper into specific hypotheses/experiments and helps not getting lost w.r.t where the system is and what it's converging on.

levidiamode's tweet photo. 147/365 of GPU Programming

Iterating on my auto inference dashboard and adding a few functionalities that have helped me get a better intuition for what is being tested and why.

Added a knowledge graph that connects experiments with learnings, a live queue that reorders based on incoming data and provides context around future experiments and active hypotheses with summaries on what is currently being validated vs ruled out.

From there, I find it much easier to dive deeper into specific hypotheses/experiments and helps not getting lost w.r.t where the system is and what it's converging on.

levidiamode's tweet photo. 147/365 of GPU Programming

Iterating on my auto inference dashboard and adding a few functionalities that have helped me get a better intuition for what is being tested and why.

Added a knowledge graph that connects experiments with learnings, a live queue that reorders based on incoming data and provides context around future experiments and active hypotheses with summaries on what is currently being validated vs ruled out.

From there, I find it much easier to dive deeper into specific hypotheses/experiments and helps not getting lost w.r.t where the system is and what it's converging on.

1

60

3

25

8K

1

56

6

32

9K

5 days ago

@seylorra auto pulled

0

1

0

0

45

5 days ago

147/365 of GPU Programming Iterating on my auto inference dashboard and adding a few functionalities that have helped me get a better intuition for what is being tested and why. Added a knowledge graph that connects experiments with learnings, a live queue that reorders based on incoming data and provides context around future experiments and active hypotheses with summaries on what is currently being validated vs ruled out. From there, I find it much easier to dive deeper into specific hypotheses/experiments and helps not getting lost w.r.t where the system is and what it's converging on.

levidiamode's tweet photo. 147/365 of GPU Programming

Iterating on my auto inference dashboard and adding a few functionalities that have helped me get a better intuition for what is being tested and why.

Added a knowledge graph that connects experiments with learnings, a live queue that reorders based on incoming data and provides context around future experiments and active hypotheses with summaries on what is currently being validated vs ruled out.

From there, I find it much easier to dive deeper into specific hypotheses/experiments and helps not getting lost w.r.t where the system is and what it's converging on.

levidiamode's tweet photo. 147/365 of GPU Programming

Iterating on my auto inference dashboard and adding a few functionalities that have helped me get a better intuition for what is being tested and why.

Added a knowledge graph that connects experiments with learnings, a live queue that reorders based on incoming data and provides context around future experiments and active hypotheses with summaries on what is currently being validated vs ruled out.

From there, I find it much easier to dive deeper into specific hypotheses/experiments and helps not getting lost w.r.t where the system is and what it's converging on.

levidiamode's tweet photo. 147/365 of GPU Programming

Iterating on my auto inference dashboard and adding a few functionalities that have helped me get a better intuition for what is being tested and why.

Added a knowledge graph that connects experiments with learnings, a live queue that reorders based on incoming data and provides context around future experiments and active hypotheses with summaries on what is currently being validated vs ruled out.

From there, I find it much easier to dive deeper into specific hypotheses/experiments and helps not getting lost w.r.t where the system is and what it's converging on.

6 days ago

146/365 of GPU Programming Happy Friday! Excited to run some experiments today and catch up on CS336 over the weekend. Speaking of, lecture 8 is so so good. Such a nice continuation of lecture 7. Those two are probably my favorite lectures I've watched this year. By far the best resource on parallelism I've come across so far. Lecture 7 setting the foundations from first principles and lecture 8 diving into the details of ZeRO stages 1/2/3, zero bubble pipelining, tensor parallelism, expert parallelism, context parallelism and how they're applied in Nemotron 3, DeepSeek V3, Llama 3, Gemma 2, Mixtra 8x22b, Qwen 3, etc I have such a deep admiration for Percy's and Tatsu's teaching style, I really hope they continue to share their course materially publicly.

levidiamode's tweet photo. 146/365 of GPU Programming

Happy Friday! Excited to run some experiments today and catch up on CS336 over the weekend.

Speaking of, lecture 8 is so so good. Such a nice continuation of lecture 7. Those two are probably my favorite lectures I've watched this year.

By far the best resource on parallelism I've come across so far. Lecture 7 setting the foundations from first principles and lecture 8 diving into the details of ZeRO stages 1/2/3, zero bubble pipelining, tensor parallelism, expert parallelism, context parallelism and how they're applied in Nemotron 3, DeepSeek V3, Llama 3, Gemma 2, Mixtra 8x22b, Qwen 3, etc

I have such a deep admiration for Percy's and Tatsu's teaching style, I really hope they continue to share their course materially publicly.

levidiamode's tweet photo. 146/365 of GPU Programming

Happy Friday! Excited to run some experiments today and catch up on CS336 over the weekend.

Speaking of, lecture 8 is so so good. Such a nice continuation of lecture 7. Those two are probably my favorite lectures I've watched this year.

By far the best resource on parallelism I've come across so far. Lecture 7 setting the foundations from first principles and lecture 8 diving into the details of ZeRO stages 1/2/3, zero bubble pipelining, tensor parallelism, expert parallelism, context parallelism and how they're applied in Nemotron 3, DeepSeek V3, Llama 3, Gemma 2, Mixtra 8x22b, Qwen 3, etc

I have such a deep admiration for Percy's and Tatsu's teaching style, I really hope they continue to share their course materially publicly.

levidiamode's tweet photo. 146/365 of GPU Programming

Happy Friday! Excited to run some experiments today and catch up on CS336 over the weekend.

Speaking of, lecture 8 is so so good. Such a nice continuation of lecture 7. Those two are probably my favorite lectures I've watched this year.

By far the best resource on parallelism I've come across so far. Lecture 7 setting the foundations from first principles and lecture 8 diving into the details of ZeRO stages 1/2/3, zero bubble pipelining, tensor parallelism, expert parallelism, context parallelism and how they're applied in Nemotron 3, DeepSeek V3, Llama 3, Gemma 2, Mixtra 8x22b, Qwen 3, etc

I have such a deep admiration for Percy's and Tatsu's teaching style, I really hope they continue to share their course materially publicly.

levidiamode's tweet photo. 146/365 of GPU Programming

Happy Friday! Excited to run some experiments today and catch up on CS336 over the weekend.

Speaking of, lecture 8 is so so good. Such a nice continuation of lecture 7. Those two are probably my favorite lectures I've watched this year.

By far the best resource on parallelism I've come across so far. Lecture 7 setting the foundations from first principles and lecture 8 diving into the details of ZeRO stages 1/2/3, zero bubble pipelining, tensor parallelism, expert parallelism, context parallelism and how they're applied in Nemotron 3, DeepSeek V3, Llama 3, Gemma 2, Mixtra 8x22b, Qwen 3, etc

I have such a deep admiration for Percy's and Tatsu's teaching style, I really hope they continue to share their course materially publicly.

3

95

7

75

11K

1

60

3

25

8K

6 days ago

Link to lecture 8 https://t.co/lGsTwTBMyz

0

4

0

3

346

6 days ago

146/365 of GPU Programming Happy Friday! Excited to run some experiments today and catch up on CS336 over the weekend. Speaking of, lecture 8 is so so good. Such a nice continuation of lecture 7. Those two are probably my favorite lectures I've watched this year. By far the best resource on parallelism I've come across so far. Lecture 7 setting the foundations from first principles and lecture 8 diving into the details of ZeRO stages 1/2/3, zero bubble pipelining, tensor parallelism, expert parallelism, context parallelism and how they're applied in Nemotron 3, DeepSeek V3, Llama 3, Gemma 2, Mixtra 8x22b, Qwen 3, etc I have such a deep admiration for Percy's and Tatsu's teaching style, I really hope they continue to share their course materially publicly.

levidiamode's tweet photo. 146/365 of GPU Programming

Happy Friday! Excited to run some experiments today and catch up on CS336 over the weekend.

Speaking of, lecture 8 is so so good. Such a nice continuation of lecture 7. Those two are probably my favorite lectures I've watched this year.

By far the best resource on parallelism I've come across so far. Lecture 7 setting the foundations from first principles and lecture 8 diving into the details of ZeRO stages 1/2/3, zero bubble pipelining, tensor parallelism, expert parallelism, context parallelism and how they're applied in Nemotron 3, DeepSeek V3, Llama 3, Gemma 2, Mixtra 8x22b, Qwen 3, etc

I have such a deep admiration for Percy's and Tatsu's teaching style, I really hope they continue to share their course materially publicly.

levidiamode's tweet photo. 146/365 of GPU Programming

Happy Friday! Excited to run some experiments today and catch up on CS336 over the weekend.

Speaking of, lecture 8 is so so good. Such a nice continuation of lecture 7. Those two are probably my favorite lectures I've watched this year.

By far the best resource on parallelism I've come across so far. Lecture 7 setting the foundations from first principles and lecture 8 diving into the details of ZeRO stages 1/2/3, zero bubble pipelining, tensor parallelism, expert parallelism, context parallelism and how they're applied in Nemotron 3, DeepSeek V3, Llama 3, Gemma 2, Mixtra 8x22b, Qwen 3, etc

I have such a deep admiration for Percy's and Tatsu's teaching style, I really hope they continue to share their course materially publicly.

levidiamode's tweet photo. 146/365 of GPU Programming

Happy Friday! Excited to run some experiments today and catch up on CS336 over the weekend.

Speaking of, lecture 8 is so so good. Such a nice continuation of lecture 7. Those two are probably my favorite lectures I've watched this year.

By far the best resource on parallelism I've come across so far. Lecture 7 setting the foundations from first principles and lecture 8 diving into the details of ZeRO stages 1/2/3, zero bubble pipelining, tensor parallelism, expert parallelism, context parallelism and how they're applied in Nemotron 3, DeepSeek V3, Llama 3, Gemma 2, Mixtra 8x22b, Qwen 3, etc

I have such a deep admiration for Percy's and Tatsu's teaching style, I really hope they continue to share their course materially publicly.

levidiamode's tweet photo. 146/365 of GPU Programming

Happy Friday! Excited to run some experiments today and catch up on CS336 over the weekend.

Speaking of, lecture 8 is so so good. Such a nice continuation of lecture 7. Those two are probably my favorite lectures I've watched this year.

By far the best resource on parallelism I've come across so far. Lecture 7 setting the foundations from first principles and lecture 8 diving into the details of ZeRO stages 1/2/3, zero bubble pipelining, tensor parallelism, expert parallelism, context parallelism and how they're applied in Nemotron 3, DeepSeek V3, Llama 3, Gemma 2, Mixtra 8x22b, Qwen 3, etc

I have such a deep admiration for Percy's and Tatsu's teaching style, I really hope they continue to share their course materially publicly.

7 days ago

145/365 of GPU Programming A few additional resources that are helping me get a better sense of ML research as a process and what "taste" might mean when people talk about problem selection: -https://t.co/hV19P3dEtX by Bill Freeman - https://t.co/s6rRrJLcTd by @tomssilver - https://t.co/dmvozUfqgg by @johnschulman2 - https://t.co/FONXA8Gq1f by @MarekRei - https://t.co/sGFfKcg3Sl by @seb_ruder - https://t.co/hzzGBiGkF4 by @LouisKirschAI - https://t.co/Gc9GC3R8yd by @ericjang11 - https://t.co/X3FT1gs5Sw by @ch402 Some of these go back over a decade but I think are just as relevant today. I can't say I have any research taste yet or know what good ML research is yet but these blog posts have been really helpful in understanding the perspective of how great researchers select what to work on a bit better.

levidiamode's tweet photo. 145/365 of GPU Programming

A few additional resources that are helping me get a better sense of ML research as a process and what "taste" might mean when people talk about problem selection:
-https://t.co/hV19P3dEtX by Bill Freeman
- https://t.co/s6rRrJLcTd by @tomssilver
- https://t.co/dmvozUfqgg by @johnschulman2
- https://t.co/FONXA8Gq1f by @MarekRei
- https://t.co/sGFfKcg3Sl by @seb_ruder
- https://t.co/hzzGBiGkF4 by @LouisKirschAI
- https://t.co/Gc9GC3R8yd by @ericjang11
- https://t.co/X3FT1gs5Sw by @ch402

Some of these go back over a decade but I think are just as relevant today. I can't say I have any research taste yet or know what good ML research is yet but these blog posts have been really helpful in understanding the perspective of how great researchers select what to work on a bit better.

levidiamode's tweet photo. 145/365 of GPU Programming

A few additional resources that are helping me get a better sense of ML research as a process and what "taste" might mean when people talk about problem selection:
-https://t.co/hV19P3dEtX by Bill Freeman
- https://t.co/s6rRrJLcTd by @tomssilver
- https://t.co/dmvozUfqgg by @johnschulman2
- https://t.co/FONXA8Gq1f by @MarekRei
- https://t.co/sGFfKcg3Sl by @seb_ruder
- https://t.co/hzzGBiGkF4 by @LouisKirschAI
- https://t.co/Gc9GC3R8yd by @ericjang11
- https://t.co/X3FT1gs5Sw by @ch402

Some of these go back over a decade but I think are just as relevant today. I can't say I have any research taste yet or know what good ML research is yet but these blog posts have been really helpful in understanding the perspective of how great researchers select what to work on a bit better.

levidiamode's tweet photo. 145/365 of GPU Programming

A few additional resources that are helping me get a better sense of ML research as a process and what "taste" might mean when people talk about problem selection:
-https://t.co/hV19P3dEtX by Bill Freeman
- https://t.co/s6rRrJLcTd by @tomssilver
- https://t.co/dmvozUfqgg by @johnschulman2
- https://t.co/FONXA8Gq1f by @MarekRei
- https://t.co/sGFfKcg3Sl by @seb_ruder
- https://t.co/hzzGBiGkF4 by @LouisKirschAI
- https://t.co/Gc9GC3R8yd by @ericjang11
- https://t.co/X3FT1gs5Sw by @ch402

Some of these go back over a decade but I think are just as relevant today. I can't say I have any research taste yet or know what good ML research is yet but these blog posts have been really helpful in understanding the perspective of how great researchers select what to work on a bit better.

0

48

6

24

10K

3

95

7

75

11K

7 days ago

145/365 of GPU Programming A few additional resources that are helping me get a better sense of ML research as a process and what "taste" might mean when people talk about problem selection: -https://t.co/hV19P3dEtX by Bill Freeman - https://t.co/s6rRrJLcTd by @tomssilver - https://t.co/dmvozUfqgg by @johnschulman2 - https://t.co/FONXA8Gq1f by @MarekRei - https://t.co/sGFfKcg3Sl by @seb_ruder - https://t.co/hzzGBiGkF4 by @LouisKirschAI - https://t.co/Gc9GC3R8yd by @ericjang11 - https://t.co/X3FT1gs5Sw by @ch402 Some of these go back over a decade but I think are just as relevant today. I can't say I have any research taste yet or know what good ML research is yet but these blog posts have been really helpful in understanding the perspective of how great researchers select what to work on a bit better.

levidiamode's tweet photo. 145/365 of GPU Programming

A few additional resources that are helping me get a better sense of ML research as a process and what "taste" might mean when people talk about problem selection:
-https://t.co/hV19P3dEtX by Bill Freeman
- https://t.co/s6rRrJLcTd by @tomssilver
- https://t.co/dmvozUfqgg by @johnschulman2
- https://t.co/FONXA8Gq1f by @MarekRei
- https://t.co/sGFfKcg3Sl by @seb_ruder
- https://t.co/hzzGBiGkF4 by @LouisKirschAI
- https://t.co/Gc9GC3R8yd by @ericjang11
- https://t.co/X3FT1gs5Sw by @ch402

Some of these go back over a decade but I think are just as relevant today. I can't say I have any research taste yet or know what good ML research is yet but these blog posts have been really helpful in understanding the perspective of how great researchers select what to work on a bit better.

levidiamode's tweet photo. 145/365 of GPU Programming

A few additional resources that are helping me get a better sense of ML research as a process and what "taste" might mean when people talk about problem selection:
-https://t.co/hV19P3dEtX by Bill Freeman
- https://t.co/s6rRrJLcTd by @tomssilver
- https://t.co/dmvozUfqgg by @johnschulman2
- https://t.co/FONXA8Gq1f by @MarekRei
- https://t.co/sGFfKcg3Sl by @seb_ruder
- https://t.co/hzzGBiGkF4 by @LouisKirschAI
- https://t.co/Gc9GC3R8yd by @ericjang11
- https://t.co/X3FT1gs5Sw by @ch402

Some of these go back over a decade but I think are just as relevant today. I can't say I have any research taste yet or know what good ML research is yet but these blog posts have been really helpful in understanding the perspective of how great researchers select what to work on a bit better.

levidiamode's tweet photo. 145/365 of GPU Programming

A few additional resources that are helping me get a better sense of ML research as a process and what "taste" might mean when people talk about problem selection:
-https://t.co/hV19P3dEtX by Bill Freeman
- https://t.co/s6rRrJLcTd by @tomssilver
- https://t.co/dmvozUfqgg by @johnschulman2
- https://t.co/FONXA8Gq1f by @MarekRei
- https://t.co/sGFfKcg3Sl by @seb_ruder
- https://t.co/hzzGBiGkF4 by @LouisKirschAI
- https://t.co/Gc9GC3R8yd by @ericjang11
- https://t.co/X3FT1gs5Sw by @ch402

Some of these go back over a decade but I think are just as relevant today. I can't say I have any research taste yet or know what good ML research is yet but these blog posts have been really helpful in understanding the perspective of how great researchers select what to work on a bit better.

8 days ago

144/365 of GPU Programming Trying to learn more about what good ML research looks like and how it's conducted. One of the most helpful things so far has been online "shadowing" actual researchers. Highly recommend watching @NeelNanda5 vibe research for a couple hours with Cursor or tuning into one of @jsuarez 5h livestreams. Even though neither RL nor MechInterp are areas I'm studying, it's been super helpful seeing what their process is like and how they think about problems. I wish more researchers did livestreams like that. Especially in times where more and more of the process seems to be automated and it becomes a question of problem selection, steering and verification.

levidiamode's tweet photo. 144/365 of GPU Programming

Trying to learn more about what good ML research looks like and how it's conducted.

One of the most helpful things so far has been online "shadowing" actual researchers.

Highly recommend watching @NeelNanda5 vibe research for a couple hours with Cursor or tuning into one of @jsuarez 5h livestreams. Even though neither RL nor MechInterp are areas I'm studying, it's been super helpful seeing what their process is like and how they think about problems.

I wish more researchers did livestreams like that. Especially in times where more and more of the process seems to be automated and it becomes a question of problem selection, steering and verification.

2

55

1

33

4K

0

48

6

24

10K

8 days ago

@yash1_ @NeelNanda5 and maybe an even better educator

0

1

0

0

64

8 days ago

144/365 of GPU Programming Trying to learn more about what good ML research looks like and how it's conducted. One of the most helpful things so far has been online "shadowing" actual researchers. Highly recommend watching @NeelNanda5 vibe research for a couple hours with Cursor or tuning into one of @jsuarez 5h livestreams. Even though neither RL nor MechInterp are areas I'm studying, it's been super helpful seeing what their process is like and how they think about problems. I wish more researchers did livestreams like that. Especially in times where more and more of the process seems to be automated and it becomes a question of problem selection, steering and verification.

levidiamode's tweet photo. 144/365 of GPU Programming

Trying to learn more about what good ML research looks like and how it's conducted.

One of the most helpful things so far has been online "shadowing" actual researchers.

Highly recommend watching @NeelNanda5 vibe research for a couple hours with Cursor or tuning into one of @jsuarez 5h livestreams. Even though neither RL nor MechInterp are areas I'm studying, it's been super helpful seeing what their process is like and how they think about problems.

I wish more researchers did livestreams like that. Especially in times where more and more of the process seems to be automated and it becomes a question of problem selection, steering and verification.

9 days ago

143/365 of GPU Programming Working on my GPU agent dashboard setup today. A few features that have been really helpful so far are a live view of GPUs and their states/utilizations, a quick overview section of current/future/past experiments with easy access to summary stats + descriptions of what was run alongside hypotheses and rationale for these experiments, a running log of things happening (whether queueing experiments or coding work). Definitely still quite bare bones but already been quite helpful in the way I approach this whole workflow and what I learn from it along the way rather than just being a passive observant without insight into what's actually being tested and run.

levidiamode's tweet photo. 143/365 of GPU Programming

Working on my GPU agent dashboard setup today.

A few features that have been really helpful so far are a live view of GPUs and their states/utilizations, a quick overview section of current/future/past experiments with easy access to summary stats + descriptions of what was run alongside hypotheses and rationale for these experiments, a running log of things happening (whether queueing experiments or coding work).

Definitely still quite bare bones but already been quite helpful in the way I approach this whole workflow and what I learn from it along the way rather than just being a passive observant without insight into what's actually being tested and run.

levidiamode's tweet photo. 143/365 of GPU Programming

Working on my GPU agent dashboard setup today.

A few features that have been really helpful so far are a live view of GPUs and their states/utilizations, a quick overview section of current/future/past experiments with easy access to summary stats + descriptions of what was run alongside hypotheses and rationale for these experiments, a running log of things happening (whether queueing experiments or coding work).

Definitely still quite bare bones but already been quite helpful in the way I approach this whole workflow and what I learn from it along the way rather than just being a passive observant without insight into what's actually being tested and run.

levidiamode's tweet photo. 143/365 of GPU Programming

Working on my GPU agent dashboard setup today.

A few features that have been really helpful so far are a live view of GPUs and their states/utilizations, a quick overview section of current/future/past experiments with easy access to summary stats + descriptions of what was run alongside hypotheses and rationale for these experiments, a running log of things happening (whether queueing experiments or coding work).

Definitely still quite bare bones but already been quite helpful in the way I approach this whole workflow and what I learn from it along the way rather than just being a passive observant without insight into what's actually being tested and run.

2

55

1

19

5K

2

55

1

33

4K

8 days ago

@NeelNanda5 - Link to Neel's vibe research stream: https://t.co/aixzce3awP - Link to Joseph's RL stream: https://t.co/5XDnDX83jZ

0

5

1

5

853

levidiamode retweeted

8 days ago

Flash-KMeans was only the beginning. Today, from the Flash-KMeans team, we are releasing FlashLib — a GPU library for fast, predictable, agent-ready classical ML operators. Up to 26× on KMeans, 19× on KNN, 40× on HDBSCAN, 208× on TruncatedSVD, 47× on PCA, 147× on exact t-SNE, and 49× on MultinomialNB over state-of-the-art (cuML). Blog: https://t.co/P31SGl0cyT Code: https://t.co/9nkO2hmeOl

47

2K

235

2K

862K

8 days ago

@Andy_ShuoYang super interesting, curious to see how this extends to blackwell gpus!

1

5

0

0

975

Last Seen Users on Sotwe

Trends for you

Most Popular Users