Jason Rader @packquickly - Twitter Profile

packquickly retweeted

Ben Walker

@benjaminwalker

over 1 year ago

Looking forward to presenting this work at #NeurIPS2024 ! Come find us on Thursday from 11-2 @ West Ballroom A-D #6907

0

7

2

0

349

Jason Rader @packquickly

over 1 year ago

I am genuinely interested! Empirical research around how well our ad-hoc estimates (of gradients, variances, and Fisher information for example) perform is surprisingly limited, since it needs to be constantly reevaluated as SOTA changes

0

101

Jason Rader @packquickly

over 1 year ago

Adam depends on the gradient distribution during training, which, as far as I know, we don't understand well? Here, adapted from the Adam paper, v_t is the var estimate, G_t is the gradient r.v. and X_t is an error r.v. for distribution shift.

packquickly's tweet photo. Adam depends on the gradient distribution during training, which, as far as I know, we don't understand well?

Here, adapted from the Adam paper, v_t is the var estimate, G_t is the gradient r.v. and X_t is an error r.v. for distribution shift. https://t.co/v3Pbv4bUHg

1

0

253

Jason Rader @packquickly

over 1 year ago

Should we be trying to detect distribution shift and correct it (eg. by taking more samples at the same set of parameters)? Does this matter in some models and not others?

1

0

108

Who to follow

Mahmoud Soliman

@mjsMLP

NaN. DL Software @NVIDIA, opinions are my own.

Keno Fischer

@KenoFischer

Co-Founder & CTO @JuliaHub_Inc #JuliaLang.

Sergio (🌈)🕊️#PalestineLivesMatter

@sguada

Deep Learning Researcher. Computer Vision, NLP and Reinforcement Learning. @TensorFlow #TFAgents #feminist #AntiRacist he/him

packquickly retweeted

Shreyas Kapur @shreyaskapur

about 2 years ago

My first PhD paper!🎉We learn *diffusion* models for code generation that learn to directly *edit* syntax trees of programs. The result is a system that can incrementally write code, see the execution output, and debug it. 🧵1/n

111

5K

584

3K

742K

Jason Rader @packquickly

about 2 years ago

@Apoorva__Lal Garner's usage dictionary calls these needless variants. Usage dictionaries outside of professional editing seem somewhat rare, but would be very useful in technical disciplines (if unrealistic to make.)

0

38

Jason Rader @packquickly

about 2 years ago

@bronzeagepapi Probably not. I've generally found custom kernels to be less necessary when using JAX due to compilation (the benefit may be marginal compared to PyTorch.)

1

2

0

65

Jason Rader @packquickly

about 2 years ago

Screw it, here's a new JAX implementation in Optimsitix: https://t.co/YC0MUzd1vf

Aaron Defazio

@aaron_defazio

about 2 years ago

Schedule-Free paper is up! https://t.co/vduzoGL5EP Joint work with collaborators @alicey_ang @HarshMeh1a @konstmish @akhaledv2 @AshokCutkosky We have some strong small-scale experiments on Transformers, comparing to chinchilla-style cosine 10x reduction schedules.

aaron_defazio's tweet photo. Schedule-Free paper is up!
https://t.co/vduzoGL5EP

Joint work with collaborators @alicey_ang @HarshMeh1a @konstmish @akhaledv2 @AshokCutkosky

We have some strong small-scale experiments on Transformers, comparing to chinchilla-style cosine 10x reduction schedules. https://t.co/l6aDEo2vX7

18

498

88

305

288K

2

71

7

33

12K

Jason Rader @packquickly

about 2 years ago

@aaron_defazio @alicey_ang @HarshMeh1a @konstmish @akhaledv2 @AshokCutkosky Cool stuff! Went ahead and implemented it in JAX as well using Optimistix: https://t.co/YC0MUzd1vf

0

2

0

134

Jason Rader @packquickly

about 2 years ago

@sp_monte_carlo Neat! Heuristics based on this crop up in stochastic optimisation. Often in the argument of stepsize * sum/sqrt(squared_sum) being "roughly" bounded by stepsize. Of course the independent and symmetric assumptions are both violated but that never stopped anyone...

0

1

0

99

packquickly retweeted

Keith

@keithdunn

about 2 years ago

Jasmin Paris @JasminKParis finished loop five of the #BM100 in 59:58:21.

3K

21K

3K

210

3M

Jason Rader @packquickly

over 2 years ago

@RndmForestRunnr @wser Ultimately I think the question is "who do we want getting into wser?" Optimising the lottery is not the hard part, it's agreeing on who we should be optimising for that is.

0

3

0

167

Jason Rader @packquickly

over 2 years ago

Finally, a million thanks to @PatrickKidger, who supervised this whole project. If you’re following me, chances are good you already follow him. If not, go give him a follow! (right after installing Lineax of course 😉) 4/4

0

2

0

408

Jason Rader @packquickly

over 2 years ago

⭐ Lineax is now on arXiv! ⭐ If you’re doing linear solves or linear least-squares in JAX, give it a shot today! Lineax is fast ⚡️, has new solvers (eg. QR, tridiagonal), supports general linear Operators. github: https://t.co/Qn8PEI8SYH arXiv: https://t.co/nnDSiKMWOJ 1/n

2

109

11

35

19K

Jason Rader @packquickly

over 2 years ago

The paper describes out how we achieved many of these things (such as differentiation through all our solvers,) and outlines some of the design choices we made when creating Lineax. https://t.co/nnDSiKMWOJ

1

5

0

476

Jason Rader @packquickly

over 2 years ago

@sp_monte_carlo depending on your philosophy, many would simply call this "mathematics" :)

0

1

0

103

Jason Rader @packquickly

over 2 years ago

They mentioned the choice between these two model functions made little difference in practice. While this is believable, and indeed proved to be true, it struck me as an example of a claim which is very difficult to verify using existing optimisation software. 12/12

0

2

0

419

Jason Rader @packquickly

over 2 years ago

Tikhnov regularised trust-region methods (*cough* Levenberg-Marquardt) oddly use two different approximations to the objective function at each step. One regularised, one not. What if we just regularised both? 1/

packquickly's tweet photo. Tikhnov regularised trust-region methods (*cough* Levenberg-Marquardt) oddly use two different approximations to the objective function at each step.

One regularised, one not.

What if we just regularised both?

1/ https://t.co/VJpKRDM1r2

1

32

4

12

14K

Jason Rader @packquickly

over 2 years ago

This example was taken from an offhand comment in "Training Deep and Recurrent Networks with Hessian-Free Optimisation" by Martens and Sutskever.

1

4

0

445

Jason Rader

@packquickly

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users