Dylan Lim @dylan__lim - Twitter Profile

Pinned Tweet

about 1 year ago

One kernel. One llama. Mega results. Proud to share our fully fused Llama-1B megakernel! Check out the code and blog below!

Benjamin F Spector

@bfspector

about 1 year ago

(1/5) We’ve never enjoyed watching people chop Llamas into tiny pieces. So, we’re excited to be releasing our Low-Latency-Llama Megakernel! We run the whole forward pass in single kernel. Megakernels are faster & more humane. Here’s how to treat your Llamas ethically: (Joint with @jordanjuravsky, @stuart_sul, @OwenDugan, @dylan__lim, @realDanFu, @simran_s_arora, and @HazyResearch)

bfspector's tweet photo. (1/5) We’ve never enjoyed watching people chop Llamas into tiny pieces.

So, we’re excited to be releasing our Low-Latency-Llama Megakernel! We run the whole forward pass in single kernel.

Megakernels are faster & more humane. Here’s how to treat your Llamas ethically:

(Joint with @jordanjuravsky, @stuart_sul, @OwenDugan, @dylan__lim, @realDanFu, @simran_s_arora, and @HazyResearch)

33

876

142

523

384K

1

19

1

0

2K

dylan__lim retweeted

Stuart Sul

@stuart_sul

4 months ago

(1/7) We're releasing ThunderKittens 2.0! Faster kernels, cleaner code, industry contributions, and new state-of-the-art BF16 / MXFP8 / NVFP4 GEMMs that match or surpass cuBLAS! Alongside this release, we’re equally excited to share some insights we learned while squeezing every last TFLOP out of Blackwell: (with @hazyresearch & generously supported by @cursor_ai)

stuart_sul's tweet photo. (1/7) We're releasing ThunderKittens 2.0! Faster kernels, cleaner code, industry contributions, and new state-of-the-art BF16 / MXFP8 / NVFP4 GEMMs that match or surpass cuBLAS!

Alongside this release, we’re equally excited to share some insights we learned while squeezing every last TFLOP out of Blackwell:

(with @hazyresearch & generously supported by @cursor_ai)

13

542

87

270

62K

dylan__lim retweeted

Flapping Airplanes

@flappyairplanes

5 months ago

Announcing Flapping Airplanes! We’ve raised $180M from GV, Sequoia, and Index to assemble a new guard in AI: one that imagines a world where models can think at human level without ingesting half the internet.

338

4K

257

1K

2M

dylan__lim retweeted

Stuart Sul

@stuart_sul

7 months ago

(1/6) GPU networking is the remaining AI efficiency bottleneck, and the underlying hardware is changing fast! We’re happy to release ParallelKittens, an update to ThunderKittens that lets you easily write fast computation-communication overlapped multi-GPU kernels, along with new kernels for data, tensor, sequence, and expert parallelism! Here’s a photo of overlapped kittens, along with things you should care about when optimizing multi-GPU kernels. (With @simran_s_arora, @bfspector, and @hazyresearch. Generously supported by @cursor_ai and @togethercompute)

stuart_sul's tweet photo. (1/6) GPU networking is the remaining AI efficiency bottleneck, and the underlying hardware is changing fast! We’re happy to release ParallelKittens, an update to ThunderKittens that lets you easily write fast computation-communication overlapped multi-GPU kernels, along with new kernels for data, tensor, sequence, and expert parallelism!

Here’s a photo of overlapped kittens, along with things you should care about when optimizing multi-GPU kernels.

(With @simran_s_arora, @bfspector, and @hazyresearch. Generously supported by @cursor_ai and @togethercompute)

9

512

59

503

156K

Dylan Lim @dylan__lim

9 months ago

Megakernels continue with our 8-GPU Llama-70B release! Please check out below!

Benjamin F Spector

@bfspector

9 months ago

(1/8) We’re releasing an 8-GPU Llama-70B inference engine megakernel! Our megakernel supports arbitrary batch sizes, mixed prefill+decode, a paged KV cache, instruction pipelining, dynamic scheduling, interleaved communication, and more! On ShareGPT it’s 22% faster than SGLang.

bfspector's tweet photo. (1/8) We’re releasing an 8-GPU Llama-70B inference engine megakernel! Our megakernel supports arbitrary batch sizes, mixed prefill+decode, a paged KV cache, instruction pipelining, dynamic scheduling, interleaved communication, and more! On ShareGPT it’s 22% faster than SGLang. https://t.co/nRUfEiCubk

7

322

48

226

85K

0

7

0

403

Dylan Lim @dylan__lim

9 months ago

Happy to announce that multi-GPU ThunderKittens is finally here! Help your GPU's meow better by checking out the following blog!

Stuart Sul

@stuart_sul

9 months ago

(1/6) We’re happy to share that ThunderKittens now supports writing multi-GPU kernels, with the same programming model and full compatibility with PyTorch + torchrun. We’re also releasing collective ops and fused multi-GPU GEMM kernels, up to 2.6x faster than PyTorch + NCCL. (Joint with @dylan__lim, @bfspector, and @HazyResearch. Generously supported by @cursor_ai)

stuart_sul's tweet photo. (1/6) We’re happy to share that ThunderKittens now supports writing multi-GPU kernels, with the same programming model and full compatibility with PyTorch + torchrun.

We’re also releasing collective ops and fused multi-GPU GEMM kernels, up to 2.6x faster than PyTorch + NCCL.

(Joint with @dylan__lim, @bfspector, and @HazyResearch. Generously supported by @cursor_ai)

5

357

41

209

32K

0

6

0

598

dylan__lim retweeted

Jordan Juravsky

@jordanjuravsky

about 1 year ago

Happy Throughput Thursday! We’re excited to release Tokasaurus: an LLM inference engine designed from the ground up for high-throughput workloads with large and small models. (Joint work with @achakravarthy01, @ryansehrlich, @EyubogluSabri, @brad19brown, @jshetaye, @HazyResearch, and @Azaliamirh)

jordanjuravsky's tweet photo. Happy Throughput Thursday! We’re excited to release Tokasaurus: an LLM inference engine designed from the ground up for high-throughput workloads with large and small models.

(Joint work with @achakravarthy01, @ryansehrlich, @EyubogluSabri, @brad19brown, @jshetaye, @HazyResearch, and @Azaliamirh)

7

206

47

77

46K

dylan__lim retweeted

Andrej Karpathy

@karpathy

about 1 year ago

So so so cool. Llama 1B batch one inference in one single CUDA kernel, deleting synchronization boundaries imposed by breaking the computation into a series of kernels called in sequence. The *optimal* orchestration of compute and memory is only achievable in this way.

62

2K

229

975

268K

Dylan Lim @dylan__lim

about 1 year ago

Excited to share LayoutVLM—leveraging VLMs for spatial reasoning in 3D layout generation!

Fan-Yun Sun

@sunfanyun

about 1 year ago

Spatial reasoning is a major challenge for the foundation models today, even in simple tasks like arranging objects in 3D space. #CVPR2025 Introducing LayoutVLM, a differentiable optimization framework that uses VLM to spatially reason about diverse scene layouts from unlabeled assets and open-ended language instructions 1/n

4

247

57

155

93K

0

1

0

632

Dylan Lim @dylan__lim

about 2 years ago

Had a super fun time building this out - always love working on distributed ML systems. Big thanks to @pearvc for awarding us the best startup prize at Stanford TreeHacks!

Aksh Garg

@AkshGarg03

about 2 years ago

(1/5) @CKT_Conner, @dill_pkl, @emilyzsh, and I are excited to introduce Shard - a proof-of-concept for an infinitely scalable distributed system composed of consumer hardware for training and running ML models! Features: - Data + Pipeline Parallel for handling arbitrarily large models - Algorithmic load balancing for throughput optimization - Fault tolerance for unreliable machines

22

204

26

172

86K

1

11

1

2K

Dylan Lim @dylan__lim

about 2 years ago

@AkshGarg03 AI Financial Advisory Service: 1) Advisor Devin personalizes investment strategies. 2) Risk Manager Devin assesses and mitigates financial risks. 3) Market Analyst Devin forecasts market trends using AI.

0

1

0

1K

Dylan Lim

@dylan__lim

Last Seen Users on Sotwe

Trends for you

Most Popular Users