Devina Mohan @DevinaMohan_ - Twitter Profile

6 months ago

…and find that OoD samples have higher ID estimates https://t.co/FXhs3a6ouD (work led by @QuerFont , who will also be at NeurIPS! Email him if you want to have a chat.)

0

1

0

128

Devina Mohan @DevinaMohan_

6 months ago

I will be at @NeurIPSConf in San Diego from Dec 2-7, please reach out if you want to chat about AI for science, uncertainties, probabilistic ML, astronomy etc! I will be presenting two papers at the #ML4PS workshop on Dec 6 -

1

3

1

0

160

Devina Mohan @DevinaMohan_

6 months ago

2. We examine diffusion model based intrinsic dimension estimates of the Radio Galaxy Zoo dataset (large unlabelled dataset) as a function of energy scores from an HMC based Bayesian neural network trained on a small labelled dataset (MiraBest)…

1

0

142

DevinaMohan_ retweeted

Andrej Karpathy

@karpathy

over 1 year ago

The (true) story of development and inspiration behind the "attention" operator, the one in "Attention is All you Need" that introduced the Transformer. From personal email correspondence with the author @DBahdanau ~2 years ago, published here and now (with permission) following some fake news about how it was developed that circulated here over the last few days. Attention is a brilliant (data-dependent) weighted average operation. It is a form of global pooling, a reduction, communication. It is a way to aggregate relevant information from multiple nodes (tokens, image patches, or etc.). It is expressive, powerful, has plenty of parallelism, and is efficiently optimizable. Even the Multilayer Perceptron (MLP) can actually be almost re-written as Attention over data-indepedent weights (1st layer weights are the queries, 2nd layer weights are the values, the keys are just input, and softmax becomes elementwise, deleting the normalization). TLDR Attention is awesome and a *major* unlock in neural network architecture design. It's always been a little surprising to me that the paper "Attention is All You Need" gets ~100X more err ... attention... than the paper that actually introduced Attention ~3 years earlier, by Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio: "Neural Machine Translation by Jointly Learning to Align and Translate". As the name suggests, the core contribution of the Attention is All You Need paper that introduced the Transformer neural net is deleting everything *except* Attention, and basically just stacking it in a ResNet with MLPs (which can also be seen as ~attention per the above). But I do think the Transformer paper stands on its own because it adds many additional amazing ideas bundled up all together at once - positional encodings, scaled attention, multi-headed attention, the isotropic simple design, etc. And the Transformer has imo stuck around basically in its 2017 form to this day ~7 years later, with relatively few and minor modifications, maybe with the exception better positional encoding schemes (RoPE and friends). Anyway, pasting the full email below, which also hints at why this operation is called "attention" in the first place - it comes from attending to words of a source sentence while emitting the words of the translation in a sequential manner, and was introduced as a term late in the process by Yoshua Bengio in place of RNNSearch (thank god? :D). It's also interesting that the design was inspired by a human cognitive process/strategy, of attending back and forth over some data sequentially. Lastly the story is quite interesting from the perspective of nature of progress, with similar ideas and formulations "in the air", with a particular mentions to the work of Alex Graves (NMT) and Jason Weston (Memory Networks) around that time. Thank you for the story @DBahdanau !

karpathy's tweet photo. The (true) story of development and inspiration behind the "attention" operator, the one in "Attention is All you Need" that introduced the Transformer. From personal email correspondence with the author @DBahdanau ~2 years ago, published here and now (with permission) following some fake news about how it was developed that circulated here over the last few days.

Attention is a brilliant (data-dependent) weighted average operation. It is a form of global pooling, a reduction, communication. It is a way to aggregate relevant information from multiple nodes (tokens, image patches, or etc.). It is expressive, powerful, has plenty of parallelism, and is efficiently optimizable. Even the Multilayer Perceptron (MLP) can actually be almost re-written as Attention over data-indepedent weights (1st layer weights are the queries, 2nd layer weights are the values, the keys are just input, and softmax becomes elementwise, deleting the normalization). TLDR Attention is awesome and a *major* unlock in neural network architecture design.

It's always been a little surprising to me that the paper "Attention is All You Need" gets ~100X more err ... attention... than the paper that actually introduced Attention ~3 years earlier, by Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio: "Neural Machine Translation by Jointly Learning to Align and Translate". As the name suggests, the core contribution of the Attention is All You Need paper that introduced the Transformer neural net is deleting everything *except* Attention, and basically just stacking it in a ResNet with MLPs (which can also be seen as ~attention per the above). But I do think the Transformer paper stands on its own because it adds many additional amazing ideas bundled up all together at once - positional encodings, scaled attention, multi-headed attention, the isotropic simple design, etc. And the Transformer has imo stuck around basically in its 2017 form to this day ~7 years later, with relatively few and minor modifications, maybe with the exception better positional encoding schemes (RoPE and friends).

Anyway, pasting the full email below, which also hints at why this operation is called "attention" in the first place - it comes from attending to words of a source sentence while emitting the words of the translation in a sequential manner, and was introduced as a term late in the process by Yoshua Bengio in place of RNNSearch (thank god? :D). It's also interesting that the design was inspired by a human cognitive process/strategy, of attending back and forth over some data sequentially. Lastly the story is quite interesting from the perspective of nature of progress, with similar ideas and formulations "in the air", with a particular mentions to the work of Alex Graves (NMT) and Jason Weston (Memory Networks) around that time.

Thank you for the story @DBahdanau !

133

7K

985

5K

862K

Who to follow

Akash Srivastava

@Akash16s

Building @KryptosConnect🧑‍💻 - The Future of Web3 Finance | AI & Web3 | Views my own 🙏

Pranay Tummalapalli

@pranayT1023

backprop into real life.

Nakul Garg

@nakulgarg22

accelerating intelligence in everyday objects // assistant professor @RiceECE // phd @umdcs // prev @MSFTResearch @NEC @iitdelhi

DevinaMohan_ retweeted

Shiqiang Wang @shiqiang_w

almost 2 years ago

Yes, we need more of this #ICML2024

0

200

17

32

17K

DevinaMohan_ retweeted

UKP Lab @UKPLab

almost 2 years ago

🤔 Variational learning is often thought to be impractical 🔥 Plot twist: it actually works better than Adam! Meet IVON, a new optimizer that brings the best out of variational learning – 🧵 (1/9) #NLProc #ICML2024 📰 https://t.co/GLCqCezNJJ https://t.co/tJl4iWmg5v

1

19

6

2

4K

Devina Mohan @DevinaMohan_

almost 2 years ago

I will be at my poster (#750) from 4.30 today @UncertaintyInAI! #UAI2024 https://t.co/1M2TmN6h59

Devina Mohan @DevinaMohan_

about 2 years ago

Excited to share our new paper on benchmarking Bayesian deep learning for radio galaxy classification! I’ll be at #UAI2024 and the #AABI2024 Workshop colocated with @icmlconf in July to present this work

1

30

6

0

4K

0

13

1

0

1K

DevinaMohan_ retweeted

Charles Margossian @charlesm993

almost 2 years ago

Some thoughts/ideas, "pêle-mêle" as we say in French, about VI and MCMC after attending #ProbAI. https://t.co/SAUGc8iFRg

0

17

3

4

2K

Devina Mohan @DevinaMohan_

about 2 years ago

Our VI models also experience a cold posterior effect which cannot be explained based on the existing theories in the ML literature. We have previously examined model misspecification in VI with PAC-Bayes bounds, tried different priors, looked at data augmentation..to no avail

0

308

Devina Mohan @DevinaMohan_

about 2 years ago

Excited to share our new paper on benchmarking Bayesian deep learning for radio galaxy classification! I’ll be at #UAI2024 and the #AABI2024 Workshop colocated with @icmlconf in July to present this work

Anna Scaife @radastrat

about 2 years ago

Our #UAI2024 paper, led by @DevinaMohan_ , is on arXiv! How different Bayesian approximations in deep-learning perform for galaxy classification cf. an HMC baseline. [model performance, uncertainty calibration, dataset shift detection] https://t.co/AJQE5TtlLs @UncertaintyInAI

radastrat's tweet photo. Our #UAI2024 paper, led by @DevinaMohan_ , is on arXiv!

How different Bayesian approximations in deep-learning perform for galaxy classification cf. an HMC baseline.

[model performance, uncertainty calibration, dataset shift detection]

https://t.co/AJQE5TtlLs
@UncertaintyInAI https://t.co/2phxZinQz1

1

24

5

0

4K

1

30

6

0

4K

Devina Mohan @DevinaMohan_

about 2 years ago

Another interesting effect we observed was the trade off between predictive performance and uncertainty calibration on using data augmentation with HMC - but augmentation does not seem to affect our VI models (which have been linked to the cold posterior effect in recent work)

DevinaMohan_'s tweet photo. Another interesting effect we observed was the trade off between predictive performance and uncertainty calibration on using data augmentation with HMC - but augmentation does not seem to affect our VI models (which have been linked to the cold posterior effect in recent work) https://t.co/5OYhhDW8nT

1

0

548

DevinaMohan_ retweeted

nature

@Nature

about 2 years ago

Three scientist mothers call for a change in how conference childcare costs are reimbursed, drawing on their personal experiences https://t.co/QQBhFVDGSi

11

620

219

50

156K

DevinaMohan_ retweeted

Dhruv Batra

@DhruvBatra_

about 2 years ago

.@eccvconf reviews are out and the official notification email points to a blog @deviparikh, @stefmlee, and I wrote a few years ago for our students. Glad to see that our lab style is spreading to the community :-). https://t.co/XNSnsomPiy

5

107

12

38

14K

Devina Mohan @DevinaMohan_

about 2 years ago

@ayzwah Almost expected the two robot arms to high five after the insertion

0

96

DevinaMohan_ retweeted

Anna Scaife @radastrat

over 2 years ago · Manchester

I’m offering a deep-learning PhD project with my colleagues @sun_mingfei & @JuliaHandl : Combinatorial Optimisation on Graphs for Science Operations in Radio Astronomy ✨📡💻 Get in contact if you are interested! #AI4Astro

0

12

6

2

2K

Devina Mohan @DevinaMohan_

over 2 years ago

It’s #ML4PS workshop day at @NeurIPSConf! I’ll be at my poster in the afternoon session (#231). Come by if you’d like to hear about HMC and model misspecification in variational inference for radio galaxy classification: https://t.co/tgBUzREheL #NeurIPS2023

Kyle Cranmer @KyleCranmer

over 2 years ago

I'm looking forward to the Machine Learning and the Physical Sciences workshop today (Friday). We are in Hall B2, stop by #ML4PS2023 #NeurIPS2023 @ML4PhyS https://t.co/QqllWCv3ZW

KyleCranmer's tweet photo. I'm looking forward to the Machine Learning and the Physical Sciences workshop today (Friday). We are in Hall B2, stop by #ML4PS2023 #NeurIPS2023 @ML4PhyS
https://t.co/QqllWCv3ZW https://t.co/wQkWWtRQZm

1

100

16

5

15K

0

6

0

2K

Devina Mohan

@DevinaMohan_

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users