Valentin Thomas @_valthomas - Twitter Profile

about 1 month ago

Keller's approach (ultra-fast iteration) is promising because it lead to the first major innovation since Adam (Muon). CIFAR was only 2 seconds to train end-to-end which meant he could try many ideas fast. His first unoptimized Muon run was something like 30 seconds but it was clear it was onto something due to large drop in steps

1

188

15

94

22K

_valthomas retweeted

Neil Zeghidour

@neilzegh

4 months ago

Me defending my O(n^3) solution to the coding interviewer.

415

49K

5K

10K

4M

_valthomas retweeted

Jonathan Gorard @getjonwithit

5 months ago

Like @davidbessis and others, I think that Hinton is wrong. To explain why, let me tell you a brief story. About a decade ago, in 2017, I developed an automated theorem-proving framework that was ultimately integrated into Mathematica (see: https://t.co/nGCIUk44TP) (1/15)

128

3K

489

2K

848K

Valentin Thomas @_valthomas

7 months ago

@konstmish @clashluke https://t.co/Tua9D3orMI

0

3

0

53

Who to follow

Mandana Samiei

@MandanaSamiei

PhD candidate @Mila_Quebec @rllabmcgill studying reinforcement learning and the brain. Board of Directors @WiMLworkshop. Previously @ThinkSurgical

Bogdan Mazoure

@bogdan_mazoure

Research Scientist @ GoogleDeepMind. Affiliate Faculty @Mila_Quebec. PhD from @mcgillu + @MILAMontreal. ex  @Apple MLR and @MSFTResearch.

Martin Klissarov

@MartinKlissarov

RL @GoogleDeepMind, phd from @Mila_Quebec @mcgillu, previously @Apple & @Meta

Valentin Thomas @_valthomas

7 months ago

@konstmish @clashluke A while ago for many small networks (cnn, MLP) we had found that the gradient second moment, fisher and hessian tended to align pretty early in training

1

0

63

Valentin Thomas @_valthomas

9 months ago

@tensor_rotator @F_Vaggi @TacoCohen @dwarkesh_sp We had some fun paper a while ago on it https://t.co/F4FysUVWG1 and it seems like the value function does a better job at exploring (being optimistic/having "high standards") compared to the variance reducing baseline

0

1

0

59

Valentin Thomas @_valthomas

9 months ago

@tensor_rotator @F_Vaggi @TacoCohen @dwarkesh_sp Actually, the value function can be a very poor baseline for reducing the variance of the gradient: in a simple 2 arm bandit with rewards 1 and 0 and (sigmoid) proba p and 1-p the value function would be p while the variance reducing baseline is 1-p! So those are anticorrelated!

1

0

52

Valentin Thomas @_valthomas

10 months ago

@DimitrisPapail I see it's because you don't use a baseline so the update for non valid tokens is 0 right? Do you think you generally get rid of the baseline?

0

59

_valthomas retweeted

Yunhao (Robin) Tang @robinphysics

12 months ago

Maybe to one's surprise, taking KL estimates as `kl_loss` to minimize does *not* enforce the KL. This implementation, however, is quite common in open source RL repos and recent research papers. In short: grad of an unbiased KL estimate is not an unbiased estimate of KL grad.

robinphysics's tweet photo. Maybe to one's surprise, taking KL estimates as `kl_loss` to minimize does *not* enforce the KL.

This implementation, however, is quite common in open source RL repos and recent research papers.

In short: grad of an unbiased KL estimate is not an unbiased estimate of KL grad. https://t.co/19DfZu0zi8

15

658

54

608

71K

_valthomas retweeted

Vahid Balazadeh

@vahidbalazadeh

12 months ago

Can neural networks learn to map from observational datasets directly onto causal effects? YES! Introducing CausalPFN, a foundation model trained on simulated data that learns to do in-context heterogeneous causal effect estimation, based on prior-fitted networks (PFNs). Joint work with @Layer6AI & @hamid_R_kamkar w/ @_valthomas, Jeremy Ma, Benson Li, Jesse C. Cresswell, & @rahulgk 📝ArXiv: https://t.co/jc9plTMo44 🔗Code: https://t.co/MVO8j24mR8 🗣️Oral paper @ ICML SIM workshop 🧵[1/7]

vahidbalazadeh's tweet photo. Can neural networks learn to map from observational datasets directly onto causal effects?

YES! Introducing CausalPFN, a foundation model trained on simulated data that learns to do in-context heterogeneous causal effect estimation, based on prior-fitted networks (PFNs). Joint work with @Layer6AI & @hamid_R_kamkar
w/ @_valthomas, Jeremy Ma, Benson Li, Jesse C. Cresswell, & @rahulgk

📝ArXiv: https://t.co/jc9plTMo44
🔗Code: https://t.co/MVO8j24mR8
🗣️Oral paper @ ICML SIM workshop

🧵[1/7]

3

35

11

3

4K

Valentin Thomas @_valthomas

about 1 year ago

@leloykun Isn't that just a bias you can fold in the learning rate? I'm not sure it matters at all compared to having a non constant bias of the return (by using a value function for instance)

1

0

316

Valentin Thomas @_valthomas

over 1 year ago

@Nils_Reimers @jxmnop I think the question is about the ratio between the FF dim and the transformer dim, even for the same parameter count.

0

123

Valentin Thomas @_valthomas

over 1 year ago

@y0b1byte And I totally forgot but it leads to an additional -pi1 grad log pi For the negative sample So RL pushed down log prob of negative samples but doesn't push up as much log prob of positive In contrast SFT pushes up/copies positives examples

0

1

0

46

Valentin Thomas @_valthomas

over 1 year ago

@y0b1byte So you also didn't add a baseline, if you do it's value is pi(tau1) Leading to (1 - pi(tau1)) Nabla log pi(tau1) For the gradient. So there's an additional saturation effect which can also help with exploration

1

3

0

718

Valentin Thomas @_valthomas

over 1 year ago

@FSchaipp That's a very interesting question. I had worked on second order, fisher, and some ADMM stuff a while ago. it was kind of an open secret among optimization researchers I knew that it didn't generalize as well. Would love to see it confirmed or debunked!

0

1

0

60

_valthomas retweeted

Fabian Schaipp @FSchaipp

over 1 year ago

Optimization hyperparameters (LR, schedule, weight decay) do not affect loss-to-loss scaling of LLMs (which could be seen as a proxy for generalization). ☄️ Unclear: how about different optimizers (Shampoo, ScheduleFree...)? Plots from this paper: https://t.co/h7F6yafLaA

FSchaipp's tweet photo. Optimization hyperparameters (LR, schedule, weight decay) do not affect loss-to-loss scaling of LLMs (which could be seen as a proxy for generalization). ☄️

Unclear: how about different optimizers (Shampoo, ScheduleFree...)?

Plots from this paper: https://t.co/h7F6yafLaA https://t.co/vaVGgfViMr

4

89

9

64

6K

_valthomas retweeted

Surya Ganguli

@SuryaGanguli

over 1 year ago

*Every single* cure for a disease ultimately flowed from basic exploratory research. Stopping basic research is like stopping the mountain rains and expecting rivers of cures to still flow. Examples: 1) studying saliva of Gila monster -> GLP1's 2) studying funghi -> first statins 3) mRNA biology -> gene therapy for spinal atrophy 4) studying bacterial genetics -> CRISPR gene therapies 5) studies of nuclear magnetic resonance -> MRI scans this list can go on and on. Not only in biology but all aspects of technology.... e.g. 6) curvature of spacetime -> GPS 7) quantum mechanics -> semiconductors 8) electromagnetism -> fiber optics -> internet ...

171

10K

1K

743K

_valthomas retweeted