Bijay Gurung @bg_learning - Twitter Profile

Pinned Tweet

Bijay Gurung @bg_learning

about 6 years ago

Started a newsletter! Subscribe to get some "Things Of Note" every Saturday :D https://t.co/zkma9mw0FW

2

16

0

1

0

Bijay Gurung @bg_learning

8 months ago

Always lit 💛

0

2

0

56

bg_learning retweeted

Pranaya Rana @inkthink

10 months ago

Let's call it like it is: Today, the KP Sharma Oli government murdered at least 19 young Nepalis in cold blood.

3

345

168

8

25K

bg_learning retweeted

Tridev Gurung

@tridevgurung

10 months ago

21 CONFIRMED DEATHS I urge international media to cover today’s events in Nepal. A peaceful, youth-led protest against corruption turned violent when police fired live bullets, killing 21. Government is trying to distort truth by framing protest as only about a social media ban.

51

10K

5K

277

254K

Who to follow

A thinker. An Engineer; Cinephile; Radiohead; Football; Chess; Psychology; Art; Philosophy; Fitness;

Avishekh Shrestha

@avishekh_

Learning from machines

bg_learning retweeted

Oliver Traldi

@olivertraldi

about 1 year ago

SOCRATES: So then the beautiful is also the good, and the just as well CHATGPT: By Zeus, you're right indeed, Socrates. And it says a lot about you that you came up with such a stunning insight. Let's delve into WHY you cooked so hard

8

693

64

28

27K

bg_learning retweeted

Russ Roberts

@EconTalker

over 1 year ago

My dad died five years ago. My eulogy for him: https://t.co/M1CswtJple

9

321

12

123

179K

bg_learning retweeted

Sherjil Ozair

@sherjilozair

over 1 year ago

Very happy to hear that GANs are getting the test of time award at NeurIPS 2024. The NeurIPS test of time awards are given to papers which have stood the test of the time for a decade. I took some time to reminisce how GANs came about and how AI has evolve in the last decade.

17

972

118

369

220K

bg_learning retweeted

Andrej Karpathy

@karpathy

over 1 year ago

The (true) story of development and inspiration behind the "attention" operator, the one in "Attention is All you Need" that introduced the Transformer. From personal email correspondence with the author @DBahdanau ~2 years ago, published here and now (with permission) following some fake news about how it was developed that circulated here over the last few days. Attention is a brilliant (data-dependent) weighted average operation. It is a form of global pooling, a reduction, communication. It is a way to aggregate relevant information from multiple nodes (tokens, image patches, or etc.). It is expressive, powerful, has plenty of parallelism, and is efficiently optimizable. Even the Multilayer Perceptron (MLP) can actually be almost re-written as Attention over data-indepedent weights (1st layer weights are the queries, 2nd layer weights are the values, the keys are just input, and softmax becomes elementwise, deleting the normalization). TLDR Attention is awesome and a *major* unlock in neural network architecture design. It's always been a little surprising to me that the paper "Attention is All You Need" gets ~100X more err ... attention... than the paper that actually introduced Attention ~3 years earlier, by Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio: "Neural Machine Translation by Jointly Learning to Align and Translate". As the name suggests, the core contribution of the Attention is All You Need paper that introduced the Transformer neural net is deleting everything *except* Attention, and basically just stacking it in a ResNet with MLPs (which can also be seen as ~attention per the above). But I do think the Transformer paper stands on its own because it adds many additional amazing ideas bundled up all together at once - positional encodings, scaled attention, multi-headed attention, the isotropic simple design, etc. And the Transformer has imo stuck around basically in its 2017 form to this day ~7 years later, with relatively few and minor modifications, maybe with the exception better positional encoding schemes (RoPE and friends). Anyway, pasting the full email below, which also hints at why this operation is called "attention" in the first place - it comes from attending to words of a source sentence while emitting the words of the translation in a sequential manner, and was introduced as a term late in the process by Yoshua Bengio in place of RNNSearch (thank god? :D). It's also interesting that the design was inspired by a human cognitive process/strategy, of attending back and forth over some data sequentially. Lastly the story is quite interesting from the perspective of nature of progress, with similar ideas and formulations "in the air", with a particular mentions to the work of Alex Graves (NMT) and Jason Weston (Memory Networks) around that time. Thank you for the story @DBahdanau !

karpathy's tweet photo. The (true) story of development and inspiration behind the "attention" operator, the one in "Attention is All you Need" that introduced the Transformer. From personal email correspondence with the author @DBahdanau ~2 years ago, published here and now (with permission) following some fake news about how it was developed that circulated here over the last few days.

Attention is a brilliant (data-dependent) weighted average operation. It is a form of global pooling, a reduction, communication. It is a way to aggregate relevant information from multiple nodes (tokens, image patches, or etc.). It is expressive, powerful, has plenty of parallelism, and is efficiently optimizable. Even the Multilayer Perceptron (MLP) can actually be almost re-written as Attention over data-indepedent weights (1st layer weights are the queries, 2nd layer weights are the values, the keys are just input, and softmax becomes elementwise, deleting the normalization). TLDR Attention is awesome and a *major* unlock in neural network architecture design.

It's always been a little surprising to me that the paper "Attention is All You Need" gets ~100X more err ... attention... than the paper that actually introduced Attention ~3 years earlier, by Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio: "Neural Machine Translation by Jointly Learning to Align and Translate". As the name suggests, the core contribution of the Attention is All You Need paper that introduced the Transformer neural net is deleting everything *except* Attention, and basically just stacking it in a ResNet with MLPs (which can also be seen as ~attention per the above). But I do think the Transformer paper stands on its own because it adds many additional amazing ideas bundled up all together at once - positional encodings, scaled attention, multi-headed attention, the isotropic simple design, etc. And the Transformer has imo stuck around basically in its 2017 form to this day ~7 years later, with relatively few and minor modifications, maybe with the exception better positional encoding schemes (RoPE and friends).

Anyway, pasting the full email below, which also hints at why this operation is called "attention" in the first place - it comes from attending to words of a source sentence while emitting the words of the translation in a sequential manner, and was introduced as a term late in the process by Yoshua Bengio in place of RNNSearch (thank god? :D). It's also interesting that the design was inspired by a human cognitive process/strategy, of attending back and forth over some data sequentially. Lastly the story is quite interesting from the perspective of nature of progress, with similar ideas and formulations "in the air", with a particular mentions to the work of Alex Graves (NMT) and Jason Weston (Memory Networks) around that time.

Thank you for the story @DBahdanau !

133

7K

985

5K

863K

Bijay Gurung @bg_learning

over 1 year ago

The grave of Simone de Beauvoir and Sartre was a powerful sight... merci on the grave goes hard

0

2

0

90

Bijay Gurung @bg_learning

over 1 year ago

Nice weekend in Paris (first time)... Eiffel Tower (ET) law: it looks more majestic (more so at night?) than you expect, even when you take into account ET law...

bg_learning's tweet photo. Nice weekend in Paris (first time)...

Eiffel Tower (ET) law: it looks more majestic (more so at night?) than you expect, even when you take into account ET law... https://t.co/vexJayfMXJ

1

4

0

236

Bijay Gurung @bg_learning

over 1 year ago

Also maybe for the Fall weather but was drawn to cemetery visits. A visit to the graves of Jim Morrison, Edith Piaf, Oscar Wilde...

bg_learning's tweet photo. Also maybe for the Fall weather but was drawn to cemetery visits. A visit to the graves of Jim Morrison, Edith Piaf, Oscar Wilde... https://t.co/1OtyHYYV6o

1

0

95

Bijay Gurung @bg_learning

over 1 year ago

@baibhavbista Ayy, anyway, looking forward to more of these ramblings :)

1

0

14

Bijay Gurung @bg_learning

over 1 year ago

@baibhavbista For some reason got reminded of this essay... https://t.co/FBUfIGdd2D I am also afflicted by the lure of consuming information all the time rather than slowing or even pausing to think for myself on all or even some of it...

bg_learning's tweet photo. @baibhavbista For some reason got reminded of this essay...
https://t.co/FBUfIGdd2D

I am also afflicted by the lure of consuming information all the time rather than slowing or even pausing to think for myself on all or even some of it... https://t.co/56CBbPAceN

1

0

24

Bijay Gurung @bg_learning

over 1 year ago

pov: came to the wine capital of the world, just ate Döner Also Canelé -> addictively good

0

2

0

1

105

Bijay Gurung @bg_learning

over 1 year ago

An evening in Bordeaux... (Should have planned a longer stay)

1

6

0

200

Bijay Gurung @bg_learning

over 1 year ago

kids are such great randomness generators... Friend: *puts on classical ragas* Friend's kid (4) : "Ehh, halt! Das tut weh. Ich muss kotzen" (stop! It hurts. I will puke) lol

0

4

0

126

Bijay Gurung @bg_learning

over 1 year ago

Cookies kick off

0

4

0

119

Bijay Gurung @bg_learning

over 1 year ago

Randomly just realized/saw, for the first time, that XD is a laughing face rotated i.e. same class as :D and :) Always just parsed it as X & D and didn't think about why that meant laughing xD

0

1

0

93

Bijay Gurung

@bg_learning

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users