@KostasVaxevanis Δεν ανοίγεις κανένα λεξικό πριν τουιτάρεις; Τόση αυτοπεποίθηση στα αγγλικά σου;
"If you buy into an idea or plan, you give it your support or agree with it:
Parents are expected to buy into the school's philosophy when they enroll their children."
https://t.co/fnzdMEKDJY
@YiTayML I believe the main point is that you cannot train an encoder without a decoder. In "encoder-only" models (e.g. BERT), the final layers act as a decoder. And in "decoder-only" models, the first layers act as an encoder. Encoding means contextualization.
@GaryMarcus@sama So, you claimed “deep learning is a hitting a wall” and 5 months later ChatGPT was introduced, followed by GPT-4. So you made one of the most failed predictions in the history of AI, and yet you seem to celebrate for it.
@ylecun "RLHF may decrease e, but will not change the fact that the token generation process is auto-regressive." But there are other decoders for AR models, from Beam-Search to generating >1 decoding and applying self-consistency in a semantic space. It's just that they are slower.
@rdesh26@shinjiw_at_cmu Thanks Desh. I was reading the 2nd this morning and it's strongly recommended (I liked the implicit vs explicit alignment distinction, amongst others). But the 1st one looks nice too.
@NandoDF Assuming access to the training set of the initial LM (e.g. GPT-3), one may create RLHF examples based on whether the info is included in the training set of GPT-3 and encourage responses such as "I don't know that" . Am I right?
Our papers at #ICASSP2023 :
Speech-based emotion recognition with self-supervised models using attentive channel-wise correlations and label smoothing
Parameter-efficient transfer learning of pre-trained Transformer models for speaker verification using adapters
See you in Rhodes
@ylecun@artemon 50 computational steps even when chain-of-thought is used? Don't you consider chain-of-though as a setting where the computational steps provided by the layers are multiplied by a factor proportional to the length of the chain?
@alfcnz @_florianmai It might be, but it's also relevant here. I remember you writing that normalization w.r.t. q is like flipping the roles of q and k (so effectively no change in the model). That's where I replied that this holds only for self-attention and not for other attention flavors.
@alfcnz @_florianmai That's the standard x-attention and makes perfect sense. Normalization w.r.t. q (e.g. slot-attention, VLAD) means that all input tokens should contribute equally to the overall output (so q may compete with each other to attract each token), and it can be desired in some settings
@alfcnz @_florianmai The symmetry is broken again (key from input, queries from output sequence). However, it's hard to see how to normalize w.r.t. queries here, since the queries are sequentially generated, while in VLAD the queries are just trainable vectors of fixed number (no output sequence).
@alfcnz @_florianmai That's true only for self-attention. NetVLAD is a pooling method, so the queries are learnable vectors (i.e. model parameters, clusters centres c_k). So the symmetry between queries and keys is broken anyway. The same holds for cross-attention.