John Rose @jrose2000 - Twitter Profile

@jrose2000

about 9 hours ago

@MainzOnX Fix H, D per model. Bucket S to powers of 2. Let XLA pad each bucket. Cache compiled HLOs keyed by bucket. Compile on cold start only. Amortized over 000s of requests, the recompile cost vanishes

1

0

24

John Rose

@jrose2000

about 13 hours ago

@MoritzW42 Mythos pls fix think hard no mistakes surprise me

0

201

John Rose

@jrose2000

about 13 hours ago

@roydanroy Don’t tempt me with a good time

0

186

John Rose

@jrose2000

3 days ago

From shakespeare-char up to enwik8 for the depth testing

0

29

Who to follow

RK Holsey

@rkholsey15

Texas A&M 17' | OU 13' | ΚΣ | Houston, TX

EZ$

@Aggs03

Give me three fingers of bourbon with a dash of water. 2013 Bud Light Ultimate Tailgater. https://t.co/PdZBWA24xj

David Henry

@DavidHenrySMU

Get to work with Jets pretty cool stuff! If you need to buy or sell an aircraft or need a charter give me a call. Not a very good golfer, but I enjoy it!

John Rose

@jrose2000

3 days ago

Layer-wise training usually collapses with depth. DiffusionBlocks x NanoGPT breaks the mold. Gap to baseline flat from L=6 to L=12, VRAM ~half

jrose2000's tweet photo. Layer-wise training usually collapses with depth. DiffusionBlocks x NanoGPT breaks the mold. Gap to baseline flat from L=6 to L=12, VRAM ~half https://t.co/pU5J6EUgkP

1

0

59

John Rose

@jrose2000

4 days ago

@ritv3999 Ah thanks, though I wanna pressure test further. How does the σ-shuffle hold up to more depth!?

0

16

John Rose

@jrose2000

4 days ago

@mattmireles Sorry for your loss

0

1K

John Rose

@jrose2000

4 days ago

@ritv3999 Nice! I landed in the 0.45 hood as well. Definitely been interpreting "works at all" as exciting. Curious what your arch choice was for where the noised target enters.

1

0

36

John Rose

@jrose2000

5 days ago

@ritv3999 Exciting: possible to train a causal AR language model with independent blocks and no gradient flows

1

0

41

John Rose

@jrose2000

5 days ago

@ritv3999 Early numbers: 6-layer causal GPT, 3 independent blocks. ~2.16× lower peak VRAM, val CE within ~0.6 nat of baseline. Costs more training steps to converge

2

1

0

61