Chen Lu @_chen_lu_ - Twitter Profile

Chen Lu @_Chen_Lu_

3 months ago

Composer 2 is pretty good!

Cursor @cursor_ai

3 months ago

We're releasing a technical report describing how Composer 2 was trained.

169

5K

484

4K

1M

0

30

1

0

1K

Chen Lu @_Chen_Lu_

3 months ago

@kevinyang 🔥 congrats!

0

1

0

47

_Chen_Lu_ retweeted

Magic @magicailabs

almost 2 years ago

LTM-2-Mini is our first model with a 100 million token context window. That’s 10 million lines of code, or 750 novels. Full blog: https://t.co/oFz4A9ynVZ Evals, efficiency, and more ↓

170

3K

421

1K

2M

Chen Lu @_Chen_Lu_

almost 2 years ago

@yapdianang Hey Dian Ang! Don't think there were too many surprises because this was a learning project. I did learn that convolutions are hard, and you really should worry about memory transfers

0

2

0

214

Chen Lu @_Chen_Lu_

almost 2 years ago

I wrote a UNet diffusion model in pure CUDA: https://t.co/JQaLDywKtS This project was inspired by @karpathy 's llm.c (https://t.co/aybQH3NAo8). I also learnt a lot about CUDA kernels from @Si_Boehm 's Matmul blog (https://t.co/PKphlZRHz6). (1/3)

20

1K

161

1K

270K

Chen Lu @_Chen_Lu_

almost 2 years ago

@MaheshaGodekere Read Simon's blog, it's terrific

0

1

211

Chen Lu @_Chen_Lu_

almost 2 years ago

@RisingSayak @karpathy @Si_Boehm images are 64x64, using fp32, not sure if that's what you're asking?

0

1

0

599

Chen Lu @_Chen_Lu_

almost 2 years ago

@iman2_718 @karpathy @Si_Boehm Thanks for the suggestion!! It looks promising for reducing the memory reloads for the convolutions, let me check it out

0

2

0

83

Chen Lu @_Chen_Lu_

almost 2 years ago

@naklecha @karpathy @Si_Boehm Thanks @naklecha ! Excited to see what's next from aaaaaaaaaa!

1

3

0

3K

Chen Lu @_Chen_Lu_

almost 2 years ago

@TiggerSharkML @karpathy @Si_Boehm Ah that should be doable already with the kernels in Andrej's llm.c, since DiT uses the same components You'd probably want to rewrite some kernels to get best performance, otherwise you will be doing a bunch of slow tensor permutes. Definitely would be cool to see though

0

4

0

2K

Chen Lu @_Chen_Lu_

almost 2 years ago

@ChrisChoy208 @karpathy @Si_Boehm Ah thanks for spotting the cudaMalloc in the training loop Chris! I thought I had moved all of them outside the loop😅 Almost all of the memory is otherwise allocated before the training loop, so hopefully this is not affecting the times too much.

0

9

1

0

3K

Chen Lu @_Chen_Lu_

almost 2 years ago

@eliebakouch @karpathy @Si_Boehm Thanks elie!

0

2

0

3K

Chen Lu @_Chen_Lu_

almost 2 years ago

@karpathy Thanks Andrej! Big fan of your work 😊

0

6

0

945

Chen Lu @_Chen_Lu_

almost 2 years ago

The main targets for optimization are the forward and backward passes of convolutions, which are currently written in a matmul-like fashion. (3/3)

_Chen_Lu_'s tweet photo. The main targets for optimization are the forward and backward passes of convolutions, which are currently written in a matmul-like fashion.
(3/3) https://t.co/a7M5pZOlPS

1

64

3

8

11K

Chen Lu @_Chen_Lu_

almost 2 years ago

Most of the effort so far was spent getting the whole model to work. There is still a lot of room for optimization. Current per iteration timings on a single RTX 4090: - this repo: 143ms - PyTorch: 66ms - PyTorch with torch.compile: 59ms (2/3)

1

65

6

10

15K

Chen Lu

@_Chen_Lu_

Last Seen Users on Sotwe

Trends for you

Most Popular Users