Vijay

@tensorcore

Systems and GPU Performance Mechanic - TBD Ex. CUTLASS 3.x / 4.x etc

Joined July 2015

618 Following

2.5K Followers

1.5K Posts

Pinned Tweet

Vijay @__tensorcore__

4 months ago

As of last week, I am no longer at NVIDIA 🧵 Leaving the CUTLASS team was extremely hard. I will dearly miss my incredible colleagues and the extremely compelling mission statement of creating the world's best accelerator programming model w/ hardware software codesign 💚

__tensorcore__'s tweet photo. As of last week, I am no longer at NVIDIA 🧵

Leaving the CUTLASS team was extremely hard. I will dearly miss my incredible colleagues and the extremely compelling mission statement of creating the world's best accelerator programming model w/ hardware software codesign 💚 https://t.co/rF1CnQl4PF

16

370

18

77

27K

__tensorcore__ retweeted

14 days ago

After some mathematical rewrite, turns out all of transformer is a series of gemm + epilogue. Given a few optimized primitives, LLMs (and novice humans) can write speed-of-light kernels for all transformer ops!

18

1K

128

944

130K

__tensorcore__ retweeted

14 days ago

LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels. CODA reparameterizes them to hide in the matmul’s shadow, fused into its epilogue before results leave the chip. Bonus: LLMs can write fast CODA kernels too (approaching SoLs).

HanGuo97's tweet photo. LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels.

CODA reparameterizes them to hide in the matmul’s shadow, fused into its epilogue before results leave the chip.

Bonus: LLMs can write fast CODA kernels too (approaching SoLs). https://t.co/cOTeMUr4py

15

677

103

531

196K

__tensorcore__ retweeted

14 days ago

We built a kernel abstraction to rewrite the entire transformer stack as GEMM + Epilogue kernels! Neural net architectures such as transformers consist entirely of matrix multiplications and elementwise nonlinearities such as RMSNorm, log sum exp, and gated activations. Fusing these elementwise nonlinearities into GEMMs in both the forward and backward passes allows us to make training and prefill as compute-bound as possible! Our kernel abstraction CODA is implemented in CuTeDSL, and by abstracting away the fixed prologue and main loop of the GEMM kernel, we expose an epilogue function where LLMs like Claude can easily implement elementwise nonlinearities in fusions approaching speed-of-light!

1

180

24

101

19K

Who to follow

A Symposium on High Performance Chips Sponsored by the IEEE Computer Society Technical Community on Microprocessors and Microcomputers (TCMM)

Maxime Chevalier

Verified account

💖 ➞ λ: CS PhD, into compiler design, programming languages, music, simulation, ML/AI, robotics. Follow me code code reviews, stock picks and dating advice.

https://t.co/IovvdTeNzl

__tensorcore__ retweeted

29 days ago

We’ve developed our own inference engine Runtime-Optimized Serving Engine (ROSE) to serve models ranging from embeddings to trillion-parameter LLMs. With CuTeDSL integrated into our inference engine, Perplexity can build the specialized GPU kernels faster to bring models up to peak performance on NVIDIA Hopper and Blackwell GPUs.

perplexity_ai's tweet photo. We’ve developed our own inference engine Runtime-Optimized Serving Engine (ROSE) to serve models ranging from embeddings to trillion-parameter LLMs.

With CuTeDSL integrated into our inference engine, Perplexity can build the specialized GPU kernels faster to bring models up to peak performance on NVIDIA Hopper and Blackwell GPUs.

74

1K

121

352

160K

Vijay @__tensorcore__

about 1 month ago

@cHHillee @tenderizzation One could say you’re still a junior celebrity

0

3

0

0

78

Vijay @__tensorcore__

about 1 month ago

@cHHillee @tenderizzation Learning to be a celebrity I see!

2

5

0

0

470

__tensorcore__ retweeted

resham ☻ @Reshusaur

about 1 month ago

new walk of shame: agent still working, but the cafe closed

Reshusaur's tweet photo. new walk of shame:
agent still working, but the cafe closed https://t.co/MVrJWI2aj2

263

5K

188

267

603K

Vijay @__tensorcore__

about 1 month ago

@PatrickToulme @AlpinDale This ain’t true. You have the nvvm dialect for native PTX authoring too without “escape hatches”

0

6

0

1

137

__tensorcore__ retweeted

@tenderizzation

7 months ago

[ENG SUB] how it feels to use eager pytorch in 2025

28

472

60

109

87K

__tensorcore__ retweeted

Alex Zhurkevich @cudagdb

about 1 month ago

Tomorrow: Blackwell Programming lecture by yours truly at Stanford CME213, Gates B3, 1:30–2:50 PM. Bring sharp questions.

6

135

9

52

8K

__tensorcore__ retweeted

Kimi.ai @Kimi_Moonshot

about 1 month ago

We're open-sourcing FlashKDA — our high-performance CUTLASS-based implementation of Kimi Delta Attention kernels. Achieves 1.72×–2.22× prefill speedup over the flash-linear-attention baseline on H20, and works as a drop-in backend for flash-linear-attention. Explore on github: https://t.co/sf4UohXDWY

45

2K

184

616

213K

__tensorcore__ retweeted

about 2 months ago

Meta released Avocado, they call it Muse Spark. It's not open source (a bit sad). Meta TBD lab rebuilt the entire pretraining stack in 9 months and reached similar capability with >10x less compute than Llama 4 Maverick. I still think infra is the real moat in AI labs. You can train models much faster with a good infra, and it allows researchers to experiment with many more ideas much more quickly.

Yuchenj_UW's tweet photo. Meta released Avocado, they call it Muse Spark.

It's not open source (a bit sad).

Meta TBD lab rebuilt the entire pretraining stack in 9 months and reached similar capability with >10x less compute than Llama 4 Maverick.

I still think infra is the real moat in AI labs. You can train models much faster with a good infra, and it allows researchers to experiment with many more ideas much more quickly.

41

640

31

96

54K

__tensorcore__ retweeted

about 2 months ago

Excited to share what we’ve been building at Meta Superintelligence Labs! We just released Muse Spark, our first AI model. It's a natively multimodal reasoning model and the first step on our path to personal superintelligence. We've overhauled our entire stack to support scaling, and this is just the beginning. https://t.co/KNVjgMcch1

shengjia_zhao's tweet photo. Excited to share what we’ve been building at Meta Superintelligence Labs! We just released Muse Spark, our first AI model. It's a natively multimodal reasoning model and the first step on our path to personal superintelligence. We've overhauled our entire stack to support scaling, and this is just the beginning.

https://t.co/KNVjgMcch1

74

2K

172

233

235K

__tensorcore__ retweeted

about 2 months ago

1/ today we're releasing muse spark, the first model from MSL. nine months ago we rebuilt our ai stack from scratch. new infrastructure, new architecture, new data pipelines. muse spark is the result of that work, and now it powers meta ai. 🧵

alexandr_wang's tweet photo. 1/ today we're releasing muse spark, the first model from MSL. nine months ago we rebuilt our ai stack from scratch. new infrastructure, new architecture, new data pipelines. muse spark is the result of that work, and now it powers meta ai. 🧵 https://t.co/fThDXdsxwB

736

10K

1K

3K

5M

__tensorcore__ retweeted

Ji-Ha @Ji_Ha_Kim

2 months ago

Very cool! I worked on this recently, and I actually used an identical approach early on. But I believe there is a significantly better approach - a **single** minimax rational iteration can beat 5 polynomial steps!

Ji_Ha_Kim's tweet photo. Very cool! I worked on this recently, and I actually used an identical approach early on. But I believe there is a significantly better approach - a **single** minimax rational iteration can beat 5 polynomial steps! https://t.co/didxjwjErE

3

137

10

64

16K

__tensorcore__ retweeted

Alex Zhurkevich @cudagdb

2 months ago

Trtllmgen kernels are now open. Fastest prefill and decode kernels for our target workloads. We wrote these to win InferenceX, MLPerf, other benchmarks. Powering some of today’s top served models. Dive in, learn, use them, or level up your own. Enjoy. https://t.co/2aQBwcdnZL

13

334

51

260

148K

__tensorcore__ retweeted

Edward Z. Yang @ezyang

2 months ago

In my opinion, here are the most important ideas of CuTe Layouts (https://t.co/7ZSFcDeiqt) 🧵

3

249

26

286

16K

__tensorcore__ retweeted

3 months ago

The frontier has increasingly shifted to hybrid models - from Qwen to Kimi-Linear and now with NVIDIA's Nemotron-3 Super - that rely on a strong linear sequence model. Today we release Mamba-3, the most powerful linear model to date. https://t.co/OpMmqEWMkP

11

838

111

330

78K

Vijay @__tensorcore__

3 months ago

https://t.co/rkvPtJMmbe

0

15

1

0

943

__tensorcore__ retweeted

3 months ago

Excited to share @Standard_Kernel's seed round and some reflections on what we’ve learned about kernel generation and what we believe is next. Grateful to our amazing team, supporters, and the broader community pushing this space forward.

anneouyang's tweet photo. Excited to share @Standard_Kernel's seed round and some reflections on what we’ve learned about kernel generation and what we believe is next. Grateful to our amazing team, supporters, and the broader community pushing this space forward. https://t.co/MuHvIhWoeF

48

514

46

190

134K

Last Seen Users on Sotwe

Trends for you

Most Popular Users