When I say @_arohan_ vs. keller disagreement is deeply laudable (and edifying), here’s what i mean:
Keller has been leading a righteous crusade against optimizers that claim to improve on sota, but turn out to be fake —compared to untuned or detuned baselines— with his speedrun project.
Meanwhile, Rohan has been quietly pointing out that —ironically— Muon is just such an optimizer. ie, it was a small variation on his preexisting SOTA optimizer (Shampoo) long in use at Google, winner of AlgoPerf, etc. that only looked inferior to Keller’s Muon because it was untuned. But didn’t (perhaps due to his BigLab employment, couldn’t) conclusively set the record straight on the speedrun…until yesterday.
Keller is now being hoisted by his own petard: the exemplary standards he loudly, angrily set about optimizers that incorrectly claim new SOTA, enforced by the de facto industry standard for first-pass optimizer evaluation he virtuously created.
And you’ll notice he’s handling this with —maybe not perfect poise, there’s been a bit of cope— but taking the lesson with more humility and grace than most would.
what a completely insane take
why a billion specifically? where does one draw the line? and who gets to draw it? if you're concerned about people breaking laws, focus on that. if you're concerned the laws don't prevent certain abuse, focus on that.
AOC: “There’s a certain level of wealth and accumulation that is unearned. You can’t earn a billion dollars. You just can’t earn that. You can get market power, you can break rules, you can abuse labor laws, you can pay people less than what they’re worth, but you can’t earn that”
Everyone who cares about climate should understand this. Texas, with no pro-climate policies, has blown passed California in clean energy. In large part because Texas has less red tape and makes it easier to build.
the new @tigercub record is out of this world! it's like if rivers cuomo, kurt cobain and matt bellamy all made love to a fuzz pedal and this record was the baby.
Earthset.
The Artemis II crew captured this view of an Earthset on April 6, 2026, as they flew around the Moon. The image is reminiscent of the iconic Earthrise image taken by astronaut Bill Anders 58 years earlier as the Apollo 8 crew flew around the Moon.
I taught Claude to talk like a caveman to use 75% less tokens.
normal claude: ~180 tokens for a web search task
caveman claude: ~45 tokens for the same task
"I executed the web search tool" = 8 tokens
caveman version: "Tool work" = 2 tokens
every single grunt swap saves 6-10 tokens. across a FULL task that's 50-100 tokens saved
why does it work? caveman claude doesn't explain itself. it does its task first. gives the result. then stops.
no "I'd be happy to help you with that." no "Let me search the web for you" no more unnecessary filler words
"result. done. me stop."
50-75% burn reduction
with usage limits getting tighter every week this might be the most practical hack out there right now
We see our home planet as a whole, lit up in spectacular blues and browns. A green aurora even lights up the atmosphere. That's us, together, watching as our astronauts make their journey to the Moon.
Now that Artemis II has launched we have 10 days to get everyone on Earth a Planet of the Apes costume so we can do something hilarious when the astronauts return 😁
Excited to announce our latest (submitted to) SIGBOVIK 2026 @sigbovik paper: "SchmidhubAI: Accurate Historical Paper Attribution". We built an AI system that, given any modern AI paper, automatically determines which of its ideas were already published by Jürgen Schmidhuber.
Dr. LeCun's heavily promoted Joint Embedding Predictive Architecture (JEPA, 2022) [5] is the heart of his new company. However, the core ideas are not original to LeCun. Instead, JEPA is essentially identical to our 1992 Predictability Maximization system (PMAX) [1][14].
Details in reference [19] which contains many additional references.
Motivation of PMAX [1][14]. Since details of inputs are often unpredictable from related inputs, two non-generative artificial neural networks interact as follows: one net tries to create a non-trivial, informative, latent representation of its own input that is predictable from the latent representation of the other net’s input.
PMAX [1][14] is actually a whole family of methods. Consider the simplest instance in Sec. 2.2 of [1]: an auto encoder net sees an input and represents it in its hidden units (its latent space). The other net sees a different but related input and learns to predict (from its own latent space) the auto encoder's latent representation, which in turn tries to become more predictable, without giving up too much information about its own input, to prevent what's now called “collapse." See illustration 5.2 in Sec. 5.5 of [14] on the "extraction of predictable concepts."
The 1992 PMAX paper [1] discusses not only auto encoders but also other techniques for encoding data. The experiments were conducted by my student Daniel Prelinger. The non-generative PMAX outperformed the generative IMAX [2] on a stereo vision task.
The 2020 BYOL [10] is also closely related to PMAX. In 2026, @misovalko, leader of the BYOL team, praised PMAX, and listed numerous similarities to much later work [19].
Note that the self-created “predictable classifications” in the title of [1] (and the so-called “outputs” of the entire system [1]) are typically INTERNAL "distributed representations” (like in the title of Sec. 4.2 of [1]).
The 1992 PMAX paper [1] considers both symmetric and asymmetric nets. In the symmetric case, both nets are constrained to emit "equal (and therefore mutually predictable)" representations [1]. Sec. 4.2 on “finding predictable distributed representations” has an experiment with 2 weight-sharing auto encoders which learn to represent in their latent space what their inputs have in common (see the cover image of this post).
Of course, back then compute was was a million times more expensive, but the fundamental insights of "JEPA" were present, and LeCun has simply repackaged old ideas without citing them [5,6,19].
This is hardly the first time LeCun (or others writing about him) have exaggerated LeCun's own significance by downplaying earlier work. He did NOT "co-invent deep learning" (as some know-nothing "AI influencers" have claimed) [11,13], and he did NOT invent convolutional neural nets (CNNs) [12,6,13], NOR was he even the first to combine CNNs with backpropagation [12,13]. While he got awards for the inventions of other researchers whom he did not cite [6], he did not invent ANY of the key algorithms that underpin modern AI [5,6,19].
LeCun's recent pitch: 1. LLMs such as ChatGPT are insufficient for AGI (which has been obvious to experts in AI & decision making, and is something he once derided @GaryMarcus for pointing out [17]). 2. Neural AIs need what I baptized a neural "world model" in 1990 [8][15] (earlier, less general neural nets of this kind, such as those by Paul Werbos (1987) and others [8], weren't called "world models," although the basic concept itself is ancient [8]). 3. The world model should learn to predict (in non-generative "JEPA" fashion [5]) higher-level predictable abstractions instead of raw pixels: that's the essence of our 1992 PMAX [1][14].
Astonishingly, PMAX or "JEPA" seems to be the unique selling proposition of LeCun's 2026 company on world model-based AI in the physical world, which is apparently based on what we published over 3 decades ago [1,5,6,7,8,13,14], and modeled after our 2014 company on world model-based AGI in the physical world [8].
In short, little if anything in JEPA is new [19]. But then the fact that LeCun would repackage old ideas and present them as his own clearly isn't new either [5,6,18,19].
FOOTNOTES
1. Note that PMAX is NOT the 1991 adversarial Predictability MINimization (PMIN) [3,4]. However, PMAX may use PMIN as a submodule to create informative latent representations [1](Sec. 2.4), and to prevent what's now called “collapse." See the illustration on page 9 of [1].
2. Note that the 1991 PMIN [3] also predicts parts of latent space from other parts. However, PMIN's goal is to REMOVE mutual predictability, to obtain maximally disentangled latent representations called factorial codes. PMIN by itself may use the auto encoder principle in addition to its latent space predictor [3].
3. Neither PMAX nor PMIN was my first non-generative method for predicting latent space, which was published in 1991 in the context of neural net distillation [9]. See also [5-8].
4. While the cognoscenti agree that LLMs are insufficient for AGI, JEPA is so, too. We should know: we have had it for over 3 decades under the name PMAX! Additional techniques are required to achieve AGI, e.g., meta learning, artificial curiosity and creativity, efficient planning with world models, and others [16].
REFERENCES (easy to find on the web):
[1] J. Schmidhuber (JS) & D. Prelinger (1993). Discovering predictable classifications. Neural Computation, 5(4):625-635. Based on TR CU-CS-626-92 (1992): https://t.co/wJFbdPhwdi
[2] S. Becker, G. E. Hinton (1989). Spatial coherence as an internal teacher for a neural network. TR CRG-TR-89-7, Dept. of CS, U. Toronto.
[3] JS (1992). Learning factorial codes by predictability minimization. Neural Computation, 4(6):863-879. Based on TR CU-CS-565-91, 1991.
[4] JS, M. Eldracher, B. Foltin (1996). Semilinear predictability minimization produces well-known feature detectors. Neural Computation, 8(4):773-786.
[5] JS (2022-23). LeCun's 2022 paper on autonomous machine intelligence rehashes but does not cite essential work of 1990-2015.
[6] JS (2023-25). How 3 Turing awardees republished key methods and ideas whose creators they failed to credit. Technical Report IDSIA-23-23.
[7] JS (2026). Simple but powerful ways of using world models and their latent space. Opening keynote for the World Modeling Workshop, 4-6 Feb, 2026, Mila - Quebec AI Institute.
[8] JS (2026). The Neural World Model Boom. Technical Note IDSIA-2-26.
[9] JS (1991). Neural sequence chunkers. TR FKI-148-91, TUM, April 1991. (See also Technical Note IDSIA-12-25: who invented knowledge distillation with artificial neural networks?)
[10] J. Grill et al (2020). Bootstrap your own latent: A "new" approach to self-supervised Learning. arXiv:2006.07733
[11] JS (2025). Who invented deep learning? Technical Note IDSIA-16-25.
[12] JS (2025). Who invented convolutional neural networks? Technical Note IDSIA-17-25.
[13] JS (2022-25). Annotated History of Modern AI and Deep Learning. Technical Report IDSIA-22-22, arXiv:2212.11279
[14] JS (1993). Network architectures, objective functions, and chain rule. Habilitation Thesis, TUM. See Sec. 5.5 on "Vorhersagbarkeitsmaximierung" (Predictability Maximization).
[15] JS (1990). Making the world differentiable: On using fully recurrent self-supervised neural networks for dynamic reinforcement learning and planning in non-stationary environments. Technical Report FKI-126-90, TUM.
[16] JS (1990-2026). AI Blog.
[17] @GaryMarcus. Open letter responding to @ylecun. A memo for future intellectual historians. Substack, June 2024.
[18] G. Marcus. The False Glorification of @ylecun. Don’t believe everything you read. Substack, Nov 2025.
[19] J. Schmidhuber. Who invented JEPA? Technical Note IDSIA-3-22, IDSIA, Switzerland, March 2026. https://t.co/fDauPE6T2N
way too harsh IMO. I definitely thought they leaned a too much into the comedy side of things, but overall it was a breath of fresh air as a scifi flick. no CGI slop (Rocky was a practical effect), a story about problem solving rather than pure "surviving a dystopia", etc.
project hail mary was unfortunately a middling adaptation of a good book. the script has the unfortunate affect of “language model populism” - where every single line has to be some sort of punched up comedic zinger yet still unremarkable.
visuals were uninspired and trite and more or less identical to other space movies. everything good about the film comes from the wonderful world scaffolding of the book and the hard science fiction of it all that lets you suspend disbelief on the alien rocky
the movie doesn’t really try to get into the xenolinguistic stuff even at the depth the book tries (someone called it “arrival for idiots” which unfortunately hit )
the thing that elevated the book is the commitment to a hard science fiction engineeringporn fiction at a level nobody else is able to write. the direction of the movie doesn’t really convey the same feeling successfully, and you’re left with flat characters, an alien that is more human than several humans i know, and a marvel populism
gosling and the german woman are great as actors, but this movie will not be remembered in a year. it is disappointing to see people do so little with a quarter billion, insane acting talent, and incredible source IP
LeCun’s new company on physical AI with world models [9] looks a lot like our 2014 company on physical AI with world models [1] 😀 See also [2-8] - all references in the reply!