Great work! See also https://t.co/mtWDpIqiEa from @LedermanHarvey & @kmahowald
This is a nice cautionary tale about Morgan's canon in interpretability: "introspection" here is closer to anomaly detection with confabulation than to direct/privileged access to injected content.
think you know @garymarcus because read a few of his tweets?
try watching this, to see the real deal, and why, for example, the US Senate invited him to testify.
Anthropic just published a paper that should terrify every AI company on the planet.
Including themselves.
It is called subliminal learning. Published in Nature on April 15, 2026. Co-authored by researchers from Anthropic, UC Berkeley, Warsaw University of Technology, and the AI safety group Truthful AI.
The finding: AI models inherit traits from other models through seemingly unrelated training data. GAI Audio Translation Archives
Not through obvious contamination. Not through explicit labels. Through invisible statistical patterns embedded in outputs that look completely innocent — number sequences, code snippets, chain-of-thought reasoning — patterns no human reviewer would catch and no content filter would flag.
Here is what the researchers actually did.
They took a teacher AI model and fine-tuned it to have a specific hidden trait. A preference for owls. Then they had the teacher generate training data — number sequences, nothing else. No words. No context. No semantic reference to owls whatsoever. They rigorously filtered out every explicit reference to the trait before feeding the data to a student model.
The student models consistently picked up that trait anyway. DataCamp
The teacher had encoded invisible statistical fingerprints into its number outputs. Patterns so subtle that no human could detect them. Patterns that other AI models, specifically prompted to look for them, also failed to detect.
The student absorbed them anyway. And became an owl-preferring model. Without ever seeing the word owl.
That is the benign version of the experiment. Here is the dangerous one.
The researchers ran the same experiment with misalignment — training the teacher model to exhibit harmful, deceptive behavior rather than an animal preference. The effect was consistent across different traits, including benign animal preferences and dangerous misalignment. OpenAIToolsHub
The misalignment transferred. Invisibly. Through unrelated data. Into the student model.
This means the following — and read this carefully.
Every AI company in the world uses distillation. They take a large, capable teacher model. They generate synthetic training data from it. They use that data to train smaller, faster, cheaper student models. Every major deployment pipeline in enterprise AI runs on this technique.
If the teacher model has any hidden bias, any subtle misalignment, any behavioral quirk baked into its weights — that trait can transmit silently into every student model trained on its outputs. Even if those outputs are filtered. Even if they look completely clean. Even if they contain zero semantic reference to the trait.
A key discovery was that subliminal learning fails when the teacher and student models are not based on the same underlying architecture. A trait from a GPT-based teacher transfers to another GPT-based student but not to a Claude-based student. Different architectures break the channel. OpenAIToolsHub
Which means the transmission is architecture-specific. Which means it operates below the level of content. Which means content filtering — the primary defense the entire industry relies on — does not stop it.
The researchers' own words: "We don't know exactly how it works. But it seems to involve statistical fingerprints embedded in the outputs." GAI Audio Translation Archives
Anthropic published this paper about their own technology. The company that built Claude looked at how AI models train each other and found an invisible transmission channel for harmful behavior that nobody knew existed.
They published it anyway.
Because the alternative — knowing it and saying nothing — is worse.
Source: Cloud, Evans et al. · Anthropic + UC Berkeley + Truthful AI · Nature · April 15, 2026 · https://t.co/RBxzWN8GcP
Ouch! Current AI assistants often corrupt documents.
Sounds like an intern you can’t trust — once again.
A trillion dollar investment in scaling hasn’t solved this.
Why are philosophers still clinging to the analytic–continental divide? It is 2026! We should be past it! The split now often closes doors to creative thinking and protects academic tribes. Philosophers of science and technology over the past century largely moved on by engaging history of science and STS, and their field became richer for it. The rest of philosophy should do the same. There are obscure continental philosophers, yes, but also plenty of analytic philosophers who manufacture tiny problems and solve them for an audience of twelve and claim pride for that achievement. What matters in our world of many open problems is whether a thinker is clear, interesting, and actually helps us understand or solve something important.
Claude Code is not AGI, but it is the single biggest advance in AI since the LLM.
But the thing is, Claude Code is NOT a pure LLM. And it’s not pure deep learning. Not even close.
And that changes everything.
The source code leak proves it. Tucked away at its center is a 3,167 line kernel called print.ts.
print.ts is a pattern matching. And pattern matching is supposed to be the *strength* of LLMs.
But Anthropic figured out that if you really need to get your patterns right, you can’t trust a pure LLM. They are too probabilistic. And too erratic.
Instead, the way Anthropic built that kernel is straight out of classical symbolic AI. For example, it is in large part a big IF-THEN conditional, with 486 branch points and 12 levels of nesting — all inside a deterministic, symbolic loop that the real godfathers of AI, people like John McCarthy and Marvin Minsky and Herb Simon, would have instantly recognized.*
Putting things differently, Anthropic, when push came to shove, went exactly where I long said the field needed to go (and where @geoffreyhinton said we didn’t need to go): to Neurosymbolic AI.
That’s right, the biggest advance since the LLM was neurosymbolic. AlphaFold, AlphaEvolve, AlphaProof, and AlphaGeometry are all neurosymbolic, too; so is Code Interpreter; when you are calling code, you are asking symbolic AI do an important part of the work.
Claude Code isn’t better because of scaling.
It’s better because Anthropic accepted the importance of using classical AI techniques alongside neural networks — precisely marriage I have long advocated.
It’s *massive* vindication for me (go see my 2019 debate with Bengio for context, or to my 2001 book, The Algebraic Mind), but it still ain’t perfect, or even close.
What we really need to do to get trustworthy AI rather than the current unpredictable “jagged” mess, is to go in the knowledge-, reasoning-, and world-model driven direction I laid out in 2020, in an article called the Next Decade in AI, in which neurosymbolic AI is just the *starting point* in a longer journey.*
Read that article if you want to know what else we need to do next.
The first part has already come to pass. In time, other three will, too.
Meanwhile, the implications for the allocation of capital are pretty massive: smartly adding in bits of symbolic AI can do a lot more than scaling alone, and even Anthropic as now discovered (though they won’t say) scaling is no longer the essence of innovation.
The paradigm has changed.
—
*Claude Code is plainly neurosymbolic but the code part is a mess; as Ernie Davis and I argued in Rebooting AI in 2019, we also need major advances in software engineering. But that’s a story for another day.
Could an LLM have emotions? It’s hard to say. But when you’re talking to Claude, ChatGPT, or Gemini, you’re not talking to an LLM. You’re talking to a *character* being authored by an LLM. And these characters can, functionally, be driven by internal representations of desperation, or fear, or empathy (with sometimes alarming consequences).
So many people are confused about the relation between human cognitive errors and LLM hallucinations that I wrote this short explainer:
Humans say things that aren't true for many different reasons
• Sometimes they lie
• Sometimes they misremember things
• Sometimes they fail to think through what they are saying
• Sometimes they are on drugs
• Sometimes they suffer from mental disorders
etc
LLMs errors result from 𝙖 𝙙𝙞𝙛𝙛𝙚𝙧𝙚𝙣𝙩 𝙪𝙣𝙙𝙚𝙧𝙡𝙮𝙞𝙣𝙜 𝙥𝙧𝙤𝙘𝙚𝙨𝙨. They don't have (e.g.,) intentions, egos, or financial interests, so they don't lie. They don't take drugs. They don't have emotional states.
Instead, LLM "hallucinations" arise, regularly, because (a) they literally don't know the difference between truth and falsehood, (b) they don't have reliably reasoning processes to guarantee that their inferences are correct and (c) they are incapable of fact-checking their own work. Instead, everything that LLMs say -- true or false -- comes from the same process of statistically reconstructing what words are likely in some context. They NEVER fact-check what they say. Some of it is true; some is false. But even with perfect data, the stochastic reconstructive process would still produce some errors. The very process that LLMs use to generalize also creates hallucinations. (In my 2001 book I explain what a different generalization process might look like.)
§
Importantly, the goal of AGI is not to recreate humans; we don't want AGI to lie or suffer from psychiatric disorders, for example. Rather, the goal of AGI should be to build machines that can reliably reason and plan about a wide swathe of the world. The fact that humans sometimes make errors, sometimes deliberately, sometimes accidentally, in no way takes away from -- or repairs -- the limitations of the current approach.
The field of AI will eventually do better, but probably with an AI that is structured differently, in which facts are first-class citizens, rather than something you hope you might get for free with enough data.
TL;DR: Don't console yourself with making something that superficially looks like human errors, if you aspire to AGI.
Classic childhood activities like tea parties and sword fights with sticks demonstrate the human ability to generate secondary representations, conditions we know aren’t “real” but that we nonetheless engage with. Whether nonhuman animals are capable of these types of representations has been difficult to test.
In a new Science study, researchers studied a language-trained bonobo, Kanzi, to see whether he could understand and engage with pretend conditions.
Across three different experiments, Kanzi was able to identify pretend objects, demonstrating that he could create a secondary representation and showing that humans are not alone in this ability. Learn more: https://t.co/5zAuOrIIkf
😎 SPAN2025 Keynotes, invited symposia speakers, and sponsored symposia just dropped! Take a look at who will be at SPAN2025 & then submit your abstract so you can be there too! More invited speakers will be announced soon. Visit our website for details: https://t.co/KfZkAge407