Adam Santoro

@santoroAI

Research Scientist in artificial intelligence at DeepMind

Montréal, Québec

Joined May 2016

219 Following

9.1K Followers

1.2K Posts

Pinned Tweet

Adam Santoro @santoroAI

about 2 years ago

Transformers can be made sparse across their depth. When trained isoFLOP, we can match or exceed the performance of vanilla models, while saving inference FLOPs https://t.co/jWl1wuHEko

4

76

17

24

9K

santoroAI retweeted

@finbarrtimbers

about 2 years ago

reading the "mixture of depths" paper, which comes up with a novel way to conditionally apply compute depth-wise in a decoder basically they use standard MoE-style expert-choice routing but they use it to choose which tokens get to go through every block in the decoder

finbarrtimbers's tweet photo. reading the "mixture of depths" paper, which comes up with a novel way to conditionally apply compute depth-wise in a decoder

basically they use standard MoE-style expert-choice routing but they use it to choose which tokens get to go through every block in the decoder https://t.co/xQlr8EdGUN

1

19

2

10

2K

santoroAI retweeted

about 2 years ago

Gemini and I also got a chance to watch the @OpenAI live announcement of gpt4o, using Project Astra! Congrats to the OpenAI team, super impressive work!

54

1K

238

256

713K

santoroAI retweeted

Google DeepMind @GoogleDeepMind

about 2 years ago

We watched #GoogleIO with Project Astra. 👀

64

1K

220

195

467K

Who to follow

Shimon Whiteson

Research Director at Google DeepMind | Professor of Computer Science at Oxford.

research scientist at google brain. phd in neural nonsense from stanford.

Verified account

@tejasdkulkarni

Scientist @GoogleDeepMind. ex CEO @CSM_ai. Interested in AGI, Brain and AI creativity. PhD @mitbrainandcog

Adam Santoro @santoroAI

about 2 years ago

@ivanleomk The FLOPs in the feedforward are not the same (MoD uses fewer), but you need to make the total training FLOPs (FLOPs-per-ffw * training steps) the same to see the effect. So, MoD trains for more steps

1

1

0

0

39

Adam Santoro @santoroAI

about 2 years ago

@ivanleomk The top-k isn't causal because whether a token is part of the top-k depends on the router weights of tokens that are after it in the sequence. During sampling you don't have these router weights since you need to produce tokens in a causal sequence

1

0

0

0

45

Adam Santoro @santoroAI

about 2 years ago

@ivanleomk Training is not faster (it takes the same amount of FLOPs, and ~wall clock). Rather, the resultant model is ~50% faster to step during sampling (post-training) because it requires ~50% of the FLOPs in the feedforward

1

0

0

0

33

santoroAI retweeted

about 2 years ago

Mixture of depth works for 300M Seq 512 its faster and archives better loss Code: https://t.co/NrFxA0N4zV writeup: https://t.co/DLJHROJdVj

shxf0072's tweet photo. Mixture of depth works
for 300M Seq 512 its faster and archives better loss
Code: https://t.co/NrFxA0N4zV
writeup: https://t.co/DLJHROJdVj https://t.co/A9XiXQ6K1L

3

103

12

43

10K

Adam Santoro @santoroAI

about 2 years ago

@KujoJot32604166 It's the latter, preserving the original positions in the sequence

0

1

0

0

63

Adam Santoro @santoroAI

about 2 years ago

Transformers can be made sparse across their depth. When trained isoFLOP, we can match or exceed the performance of vanilla models, while saving inference FLOPs https://t.co/jWl1wuHEko

4

76

17

24

9K

Adam Santoro @santoroAI

about 2 years ago

@iamgrigorev And increasing batch, or model size, or depth, etc, each has implications on how you tune the optimizer

1

0

0

0

247

Adam Santoro @santoroAI

about 2 years ago

@iamgrigorev Apologies for not being explicit: when I say match training FLOPs, I mean *exactly* matching. So you need to calculate the FLOPs per ffw of each model and tune the training steps accordingly

1

1

0

0

275

Adam Santoro @santoroAI

about 2 years ago

@iamgrigorev Thanks for the update! FYI if you don't make up for the lost FLOPs in some way (e.g. train isoFLOP) then performance will be worse. As you can see in the paper, wall clock/FLOPs are the same during training, not total tokens. The wins then come with inference speed

2

0

0

0

94

Adam Santoro @santoroAI

about 2 years ago

@iamgrigorev @felix_red_panda @haeggee I agree, figuring out the best routing pattern per layer is an interesting thing to explore. No doubt there's something better than choosing some constant throughout the depth

0

0

0

0

52

santoroAI retweeted

George Grigorev

about 2 years ago

I have implemented Mixture-of-Depths and it shows significant memory reduction during training and 10% speed increase. I will verify if it achieves the same quality with 12.5% active tokens. https://t.co/wbOByZZz4o thanks @haeggee for initial code

iamgrigorev's tweet photo. I have implemented Mixture-of-Depths and it shows significant memory reduction during training and 10% speed increase. I will verify if it achieves the same quality with 12.5% active tokens.
https://t.co/wbOByZZz4o
thanks @haeggee for initial code https://t.co/5BdLy4ZuEE

6

353

50

233

54K

Adam Santoro @santoroAI

about 2 years ago

@felix_red_panda @iamgrigorev @haeggee e.g., putting 256 tokens through an MLP compared to 2048, per sequence. Batch size doesn't matter

0

2

0

0

177

Adam Santoro @santoroAI

about 2 years ago

@felix_red_panda @iamgrigorev @haeggee All the layers will always be active, the speed increases come from having to process a fraction of the sequence instead of the full thing. That fraction is constant as you change batch size

2

4

1

0

254

Adam Santoro @santoroAI

about 2 years ago

@iamgrigorev @haeggee Awesome! Cool to see memory reductions too, we knew they should be there but didn't measure them

0

0

0

0

684

Adam Santoro @santoroAI

about 2 years ago

@EsotericCofe and if you haven't yet, don't forget to set the training FLOP budgets appropriately (rather than training step budgets) so that the LR schedules are correct and you don't undertrain the model

0

2

0

0

162

Adam Santoro @santoroAI

about 2 years ago

@EsotericCofe Nice work! If you plot by flops instead of steps you'll get a better perspective on whether the implementation is working well (ideally the MoD transformer will have a better loss than vanilla throughout training, plotted by flops)

1

3

0

0

240

Adam Santoro @santoroAI

about 2 years ago

@haeggee @MatPagliardini @akmohtashami_a @Olivia61368522 A sigmoid sounds like a good idea :)

0

2

0

0

84

Last Seen Users on Sotwe

Trends for you

Most Popular Users