Themos Stafylakis @themosst - Twitter Profile

11 days ago

@LiliMomeni @dimadamen @ChuhanZhang5 @skandakoppula @GuillaumeMoing @JunyuXieArthur @joelle_barral @RaiaHadsell @ZoubinGhahrama1 @GoogleDeepMind @CVPR Congrats Liliane and all!

0

48

Themos Stafylakis @themosst

over 1 year ago

@LiliMomeni @Oxford_VGG @TheBMVA @GoogleDeepMind Congrats Lili! All the best!

0

1

0

284

Themos Stafylakis @themosst

over 2 years ago

@natolambert To some extent, IPO is to DPO what Wasserstein GAN was to GAN.

0

1

0

49

Themos Stafylakis @themosst

about 3 years ago

@KostasVaxevanis Δεν ανοίγεις κανένα λεξικό πριν τουιτάρεις; Τόση αυτοπεποίθηση στα αγγλικά σου; "If you buy into an idea or plan, you give it your support or agree with it: Parents are expected to buy into the school's philosophy when they enroll their children." https://t.co/fnzdMEKDJY

0

1

0

60

Who to follow

Joan Serrà

@serrjoa

Does research on machine learning at Sony AI. Works on audio/multimodal analysis, synthesis, and retrieval. Likes tennis, music, and wine.

See https://t.co/i2fOqSfhJh. Purveyor of obvious truths, ex-founder of Wikipedia. Now president, @ks_found. Buy 69,020 books on a thumb drive! 🇺🇸✞ ACNA

Themos Stafylakis @themosst

about 3 years ago

@YiTayML I believe the main point is that you cannot train an encoder without a decoder. In "encoder-only" models (e.g. BERT), the final layers act as a decoder. And in "decoder-only" models, the first layers act as an encoder. Encoding means contextualization.

0

1

0

713

Themos Stafylakis @themosst

about 3 years ago

@GaryMarcus @sama So, you claimed “deep learning is a hitting a wall” and 5 months later ChatGPT was introduced, followed by GPT-4. So you made one of the most failed predictions in the history of AI, and yet you seem to celebrate for it.

0

1

0

80

Themos Stafylakis @themosst

about 3 years ago

@ylecun "RLHF may decrease e, but will not change the fact that the token generation process is auto-regressive." But there are other decoders for AR models, from Beam-Search to generating >1 decoding and applying self-consistency in a semantic space. It's just that they are slower.

0

210

Themos Stafylakis @themosst

over 3 years ago

@rdesh26 @shinjiw_at_cmu Thanks Desh. I was reading the 2nd this morning and it's strongly recommended (I liked the implicit vs explicit alignment distinction, amongst others). But the 1st one looks nice too.

0

1

0

136

Themos Stafylakis @themosst

over 3 years ago

@NandoDF Assuming access to the training set of the initial LM (e.g. GPT-3), one may create RLHF examples based on whether the info is included in the training set of GPT-3 and encourage responses such as "I don't know that" . Am I right?

0

3

0

289

Themos Stafylakis @themosst

over 3 years ago

@GuillaumeLample Many congrats! Will you make LLaMA-I available at github?

0

208

Themos Stafylakis @themosst

over 3 years ago

@aparadektoi1991 Κάποιο λάθος έγινε, αυτό είναι απ' τα εγκαίνια του MoMA

0

2

0

151

Themos Stafylakis @themosst

over 3 years ago

@tbickle1976 Δεν είναι είδηση. Είναι Προϊστάμενος τμήματος οικονομικής Διαχείρισης Λαϊκών Αγορών της Περιφέρειας Αττικής (είχε κατεβεί με Πατούλη).

0

5

0

589

Themos Stafylakis @themosst

over 3 years ago

Our papers at #ICASSP2023 : Speech-based emotion recognition with self-supervised models using attentive channel-wise correlations and label smoothing Parameter-efficient transfer learning of pre-trained Transformer models for speaker verification using adapters See you in Rhodes

0

17

0

1K

Themos Stafylakis @themosst

over 3 years ago

@ylecun @artemon 50 computational steps even when chain-of-thought is used? Don't you consider chain-of-though as a setting where the computational steps provided by the layers are multiplied by a factor proportional to the length of the chain?

0

120

Themos Stafylakis @themosst

over 3 years ago

@denny_zhou Even if ML was statistics, you consider the latter as nearly irrelevant to AI?

0

397

Themos Stafylakis @themosst

over 3 years ago

@__DiracDelta @DTU_Compute @jesfrellsen @LeonelRozo @ninamiolane Many congrats Dimitri. Where can I find your thesis?

1

0

200

Themos Stafylakis @themosst

over 3 years ago

@alfcnz @_florianmai It might be, but it's also relevant here. I remember you writing that normalization w.r.t. q is like flipping the roles of q and k (so effectively no change in the model). That's where I replied that this holds only for self-attention and not for other attention flavors.

1

0

32

Themos Stafylakis @themosst

over 3 years ago

@alfcnz @_florianmai That's the standard x-attention and makes perfect sense. Normalization w.r.t. q (e.g. slot-attention, VLAD) means that all input tokens should contribute equally to the overall output (so q may compete with each other to attract each token), and it can be desired in some settings

1

0

41

Themos Stafylakis @themosst

over 3 years ago

@alfcnz @_florianmai The symmetry is broken again (key from input, queries from output sequence). However, it's hard to see how to normalize w.r.t. queries here, since the queries are sequentially generated, while in VLAD the queries are just trainable vectors of fixed number (no output sequence).

1

0

41

Themos Stafylakis @themosst

over 3 years ago

@alfcnz @_florianmai That's true only for self-attention. NetVLAD is a pooling method, so the queries are learnable vectors (i.e. model parameters, clusters centres c_k). So the symmetry between queries and keys is broken anyway. The same holds for cross-attention.

1

0

105

Themos Stafylakis

@themosst

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users