🚨 Postdoc opening in my group at Johns Hopkins on min-max optimization, ML/AI & large-scale RL.
Apply by June 15 for full consideration: https://t.co/IvNnz6VlSG
I’ll be at #SIAMOP26 in Edinburgh next week!
Please reach out, happy to chat!
@HopkinsDSAI, @JohnsHopkinsAMS
Bidding on my #NeurIPS AC batch today I noticed two submissions proposing a method with the exact same name, and reshuffled title words, and reworded abstract. Looks like a deliberate near duplicate submission to boost acceptance chances. Heads up ACs and reviewers.
@NeurIPSConf
@Google PhD Fellowship: Applications are now open! Fellowships directly support graduate students doing exceptional and innovative research in computer science and related fields as they pursue their PhD. Learn more and apply by April 30 at https://t.co/PZNtYojGOx
1/ We found that deep sequence models memorize atomic facts "geometrically" -- not as an associative lookup table as often imagined.
This opens up practical questions on reasoning/memory/discovery, and also poses a theoretical "memorization puzzle."
@mmbronstein@docmilanfar Thanks Michael! Just a bit correction. That's Arabic! In Farsi you say به امید خدا "be omide khoda" if you believe in god or امیدوارم "omidvaram" otherwise.
1/ New paper! "Wait, Wait, Wait… Why Do Reasoning Models Loop?"
Under greedy/low-temp decoding, reasoning LLMs get stuck in loops repeating themselves, wasting test-time compute and sometimes never terminating!
We study why this🔁 happens and why increasing temp is a band-aid
We introduce epiplexity, a new measure of information that provides a foundation for how to select, generate, or transform data for learning systems. We have been working on this for almost 2 years, and I cannot contain my excitement! 1/7
Continuing Tutorial II for Physics of Language Models.
We often trust large-scale results simply because they are large; but once noise is removed, the synthetic pretrain playground starts to push back — hard!
The second video (Part 4.1b, 90 minutes) makes this pushback concrete.
From it, I derive 20+ architectural principles, organized into 12 result blocks.
Two highlights that consistently surprise even experienced readers:
Result 2.1 (new):
"Why Canon layers actually work."
Not because of multi-token attention — that explanation only applies to the first layer.
The real mechanism is how Canon reshapes hierarchical learning across depth.
Result 11:
"Why linear models reason 4× shallower than Transformers."
This has nothing to do with memory size —
it is a structural failure shared by nearly all linear architectures.
In Result 12, I show which of these principles already emerge at academic-scale pretraining (1.3B / 100B) —
with orders-of-magnitude lower cost and far cleaner signals than many real-life large-scale runs.
The remaining principles do not disappear; they only emerge when scaling to 8B / 1T, which I will show in the third video (Part 4.2).
⏮️ Previous: Part 4.1a — methodology & playground design
▶️ This: Part 4.1b — architectural principles from the playground
🔜 Next: Part 4.2 — when the playground reshapes real-life pretraining
I'm hiring a Student Researcher to work on scaling laws at Google DeepMind! Project is for 16 weeks, starting spring/summer '26, in-person in SF (pic from the amazing office). If you're interested, fill out this form: https://t.co/nnRmY2hqeL
Don't let people underestimate you. I remember interviewing for a postdoc at an industry lab, where I introduced spectral mixture kernels. I was told my work was "NIPS-y". It wasn't a compliment and I didn't get the position. 10 years later I was asked to autograph that paper.
@Yuchenj_UW To be clear: A degree is not a magic wand. Classes alone don't create capability. But a PhD is a forcing function for the analytical rigor and depth required for foundational work. Can you acquire those tools without the program? Yes, but it’s a much steeper climb.
@Yuchenj_UW And those who created the foundations of all this (LeCun, Hinton, Bengio, and Schmidhuber) each hold a PhD. The question is where you want to contribute? Expand the breadth of what's possible with current foundations or go deep to build future foundations.
@modular_ell @GoogleDeepMind Deep understanding of the theory of finite-dimensional vector spaces is a "must-have" as we will need to rigorously analyze and construct proofs using concepts like vector subspaces, orthogonality, and spectral theory. Familiarity with numerical linear algebra is a nice plus.
🚨Intern Hiring🚨
Join Peter Bartlett and me at @GoogleDeepMind in Mountain View to study hierarchical learning in deep networks. Ideal for PhD students with a strong background in ML, optimization, linear algebra, and Python (JAX preferred). Apply here https://t.co/nTlEuO6Aj7
@HazanPrinceton Sorry to hear about no slides to share, and the board was erased, but at least it presents a creative proof (by construction) for maximizing regret🤪