Garrett Goon @goongarrett - Twitter Profile

Garrett Goon @GoonGarrett

about 2 months ago

@FilipoGiovanni @willccbb Had the same reaction

0

1

0

29

Garrett Goon @GoonGarrett

about 2 months ago

@willccbb @eliebakouch Very good. Claude's writing was entirely fine, would be happy to see more

0

142

Garrett Goon @GoonGarrett

about 2 months ago

@StefanGliga @ezyang Why do you want it to die? What's your preferred alternative?

1

0

28

Garrett Goon @GoonGarrett

about 2 months ago

@PatrickToulme @ezyang Also my expectation

0

1

0

55

Garrett Goon @GoonGarrett

about 2 months ago

@WentaoGuo7 Awesome, congrats! Have you gotten to test if the fully_shard hangs still occur with the update? I could try tomorrow if not.

1

0

136

Garrett Goon @GoonGarrett

3 months ago

@wightmanr @jeremyphoward Great call out. I was also searching recently for what init schemes are being used in recent work and couldn't find much. Arcee's Trinity tech report was one find. Can you suggest others in addition to Olmo?

0

82

Garrett Goon @GoonGarrett

3 months ago

@m_sirovatka I hit a bunch of "CUDA unspecified launch errors" recently. New one for me. Very unwelcome

1

2

0

114

Garrett Goon @GoonGarrett

3 months ago

@stochasticchasm I guess there's a middle ground where you compute the loss itself in chunks but realize the full logits as usual. Some memory savings + logits/KL unaffected

0

1

0

11

Garrett Goon @GoonGarrett

3 months ago

@stochasticchasm Ah ok, got it. Surprised about the instabilities mentioned elsewhere. What ctx len do they start to be significant at?

1

0

38

Garrett Goon @GoonGarrett

3 months ago

@difficultyang So many headaches with CC rendering. Hitting this one often. Also seems exacerbated by tmux

0

17

Garrett Goon @GoonGarrett

3 months ago

@YouJiacheng What's the strategy to counteract this, then?

0

1

0

45

Garrett Goon @GoonGarrett

3 months ago

@m_sirovatka @samsja19 <2000 tokens? 🤔

0

16

Garrett Goon @GoonGarrett

3 months ago

@eliebakouch Wow, the Sonnet jump @256k

0

69

Garrett Goon @GoonGarrett

3 months ago

@StasBekman @aryanvs_ The speed up was due to throwing more GPUs at the workload, though, right? Baseline being a single GPU, other cases using N, and seeing a near linear decrease in overall runtime? Not a per-GPU improvement

1

0

41

Garrett Goon @GoonGarrett

3 months ago

@m_sirovatka @oneill_c @Jozef_Nathaniel Nightly or GTFO (saw the other threads before this one)

0

1

0

82

Garrett Goon @GoonGarrett

3 months ago

@m_sirovatka Yeah agreed, just thinking about how you could solve it if you really needed to

0

1

19

Garrett Goon @GoonGarrett

3 months ago

@m_sirovatka Add like a fully_shard_bwd API to wrap modules whose grads should be RS'd together. E.g. fully_shard on a whole transformer block and wrap the MLP and attn sub-blocks each with fully_shard_bwd so their grads are bucket-reduced and freed earlier than default. Maybe too complicated

1

0

44

Garrett Goon @GoonGarrett

3 months ago

@m_sirovatka Yeah was thinking the same. Basically enabling the backwards bucketing strategy to be different from the fwd, rather than forcing both to consolidate all collectives into a single launch

1

0

26

Garrett Goon @GoonGarrett

3 months ago

@stochasticchasm @m_sirovatka Same, yeah. Still annoying though.

0

1

0

14

Garrett Goon

@GoonGarrett

Last Seen Users on Sotwe

Trends for you

Most Popular Users