why does this happen? the model believes there's a seahorse emoji, sure, but why does that make it output a *different* emoji? here's a clue from everyone's favorite underrated interpretability tool, logit lens!
in logit lens, we use the model's lm_head in a weird way. typically, the lm_head is used to turn the residual (the internal state built up over the model layers) into a set of token probabilities after the final layer. but in logit lens, we use the lm_head after *every* layer - showing us what tokens the model would output if that layer were the final layer.
for early layers, this results in hard-to-interpret states. but as we move through the layers, the model iteratively refines the residual first towards concepts useful for continuing the text, and then towards the final prediction.
looking at the image again, at the final layer, we have the model's actual output - ĠðŁ, IJ, ł - aka, an emoji byte prefix followed by the rest of the fish emoji.
(it looks like unicode nonsense because of a tokenization quirk - don't worry about it. if you're curious, ask claude about this line of code: `bytes([byte_decoder[c] for c in 'ĠðŁIJł']).decode('utf-8') == ' 🐠'`)
but look what happens in the middle layers - we don't just get emoji bytes! we get those *concepts*, specifically the concept of a seahorse. for example, on layer 52, we get "sea horse horse". later, in the top-k, we get a mixture of "sea", "horse", and that emoji prefix, "ĠðŁ".
so what is the model thinking about? seahorse + emoji! it's trying to construct a residual representation of a seahorse emoji.
why would it do that? well, let's look at how the lm_head actually works. the lm_head is a huge matrix of residual-sized vectors associated with token ids. when a residual is passed into it, it's going to compare that residual with each token vector, and in coordination with the sampler, select the token id with a vector most similar to the residual. (more technically: it's a linear layer without a bias, so v @ w.T does dot products with each unembedding vector, then log_softmax and argmax/temperature sample.)
so if the model wants to output the word "hello", it needs to construct a residual similar to the vector for the "hello" output token that the lm_head can turn into the hello token id. and if the model wants to output a seahorse emoji, it needs to construct a residual similar to the vector for the seahorse emoji output token(s) - which in theory could be any arbitrary value, but in practice is seahorse + emoji, word2vec style.
the only problem is the seahorse emoji doesn't exist! so when this seahorse + emoji residual hits the lm_head, it does its dot product over all the vectors, and the sampler picks the closest token - a fish emoji.
now, that discretization is valuable information! you can see in Armistice's example that when the token gets emplaced back into the context autoregressively, the model can tell it isn't a seahorse emoji. so it tries again, jiggles the residual around and gets a slightly different emoji, rinse and repeat until it realizes what's going on, gives up, or runs out of output tokens.
but until the model gets the wrong output token from the lm_head, it just doesn't know that there isn't a seahorse emoji in the lm_head. it assumes that seahorse + emoji will produce the token(s) it wants.
------------------
to speculate (even more), i wonder if this a part of the benefit of RL - it gives the models information about their lm_head that's otherwise difficult to get at because it's at the end of the layer stack. (remember that base models are not trained on their own outputs / rollouts - that only happens in RL.)
Gilbert Simondon's Image Theory and Human-technology Relations through Imagination and AI image generation. In: Philipp Roth et al. (Hrsg.): Making Media Futures: Machine Visions and Technological Imaginations. Routledge.
https://t.co/c5CQRdBLvQ
Call for Abstracts for
"Artificial Intelligence and New Human-Technology Relations as a Challenge for the Philosophy of Science"
23.05.2025, 10-16 Uhr, University of Münster (Senatssaal am Schlossplatz 1)
The #NobelPrizeinPhysics2024 for Hopfield & Hinton rewards plagiarism and incorrect attribution in computer science. It's mostly about Amari's "Hopfield network" and the "Boltzmann Machine."
1. The Lenz-Ising recurrent architecture with neuron-like elements was published in 1925 [L20][I24][I25]. In 1972, Shun-Ichi Amari made it adaptive such that it could learn to associate input patterns with output patterns by changing its connection weights [AMH1]. However, Amari is only briefly cited in the "Scientific Background to the Nobel Prize in Physics 2024." Unfortunately, Amari's net was later called the "Hopfield network." Hopfield republished it 10 years later [AMH2], without citing Amari, not even in later papers.
2. The related Boltzmann Machine paper by Ackley, Hinton, and Sejnowski (1985) [BM] was about learning internal representations in hidden units of neural networks (NNs) [S20]. It didn't cite the first working algorithm for deep learning of internal representations by Ivakhnenko & Lapa (Ukraine, 1965)[DEEP1-2][HIN]. It didn't cite Amari's separate work (1967-68)[GD1-2] on learning internal representations in deep NNs end-to-end through stochastic gradient descent (SGD). Not even the later surveys by the authors [S20][DL3][DLP] nor the "Scientific Background to the Nobel Prize in Physics 2024" mention these origins of deep learning. ([BM] also did not cite relevant prior work by Sherrington & Kirkpatrick [SK75] & Glauber [G63].)
3. The Nobel Committee also lauds Hinton et al.'s 2006 method for layer-wise pretraining of deep NNs (2006) [UN4]. However, this work neither cited the original layer-wise training of deep NNs by Ivakhnenko & Lapa (1965)[DEEP1-2] nor the original work on unsupervised pretraining of deep NNs (1991) [UN0-1][DLP].
4. The "Popular information" says: “At the end of the 1960s, some discouraging theoretical results caused many researchers to suspect that these neural networks would never be of any real use." However, deep learning research was obviously alive and kicking in the 1960s-70s, especially outside of the Anglosphere [DEEP1-2][GD1-3][CNN1][DL1-2][DLP][DLH].
5. Many additional cases of plagiarism and incorrect attribution can be found in the following reference [DLP], which also contains the other references above. One can start with Sec. 3:
[DLP] J. Schmidhuber (2023). How 3 Turing awardees republished key methods and ideas whose creators they failed to credit. Technical Report IDSIA-23-23, Swiss AI Lab IDSIA, 14 Dec 2023. https://t.co/Nz0fjc6kyx
See also the following reference [DLH] for a history of the field:
[DLH] J. Schmidhuber (2022). Annotated History of Modern AI and Deep Learning. Technical Report IDSIA-22-22, IDSIA, Lugano, Switzerland, 2022. Preprint arXiv:2212.11279. https://t.co/Ys0dw5hkF4 (This extends the 2015 award-winning survey https://t.co/7goTtI5Uwv)
‼️🌍 Programa provisional del congreso de Lisboa por el centenario de Gilbert Simondon. Si estáis por la ciudad en esas fechas, no me lo perdería. Cartelazo.
1/3
@Jonathan_Blow Also Software often needs a User Interface and therefore hides away a lot of complexity (for end users; I am not sure if you were referring to what the developers see more than to what the end user sees)
@Jonathan_Blow Check part I of Simondon's "On the Existence of Technical Objects" on the evolution of technical objects! Software, I think, does do shifts as well, but the other ways around from simple to complex and then to "new simple involving the previous complexity in elegance"
@WriterLDavis@CorboLab I am dealing with that on a daily basis. Not only depressing but so endlessly inefficient. We've had wrong author names in an edited volume that we corrected 5 times and it just did not get changed - no comment, no nothing. Really nerve wrecking
@djwalker011@JoJoFromJerz Underground radicalization. We've had the exact same story in the country I live in. Madman in power, madman shortly in prison, everyone laughing at madman, madman couple of years later in power again, World War.