Hossein Mobahi

@TheGradient

Rεsεαrch Sciεητisτ @GoogleDeepMind. I ∈ Optimization ∩ Machine Learning. Here to discuss research 🤓. Like heavy music🤘.Origin=🇮🇷 Citizen=🇺🇸.

Mountain View, CA

Joined December 2010

784 Following

6.4K Followers

1.4K Posts

TheGradient retweeted

Nicolas Loizou @NicLoizou

6 days ago

🚨 Postdoc opening in my group at Johns Hopkins on min-max optimization, ML/AI & large-scale RL. Apply by June 15 for full consideration: https://t.co/IvNnz6VlSG I’ll be at #SIAMOP26 in Edinburgh next week! Please reach out, happy to chat! @HopkinsDSAI, @JohnsHopkinsAMS

NicLoizou's tweet photo. 🚨 Postdoc opening in my group at Johns Hopkins on min-max optimization, ML/AI & large-scale RL.

Apply by June 15 for full consideration: https://t.co/IvNnz6VlSG

I’ll be at #SIAMOP26 in Edinburgh next week!
Please reach out, happy to chat!

@HopkinsDSAI, @JohnsHopkinsAMS https://t.co/aC8LogsiZd

Hossein Mobahi @TheGradient

26 days ago

@gautamcgoel @NeurIPSConf Yes, did the first, before the tweet :)

194

Hossein Mobahi @TheGradient

26 days ago

Bidding on my #NeurIPS AC batch today I noticed two submissions proposing a method with the exact same name, and reshuffled title words, and reworded abstract. Looks like a deliberate near duplicate submission to boost acceptance chances. Heads up ACs and reviewers. @NeurIPSConf

194

23K

Hossein Mobahi @TheGradient

2 months ago

@percyliang Congratulations Percy 👏🥳🎈

157

Who to follow

Tengyu Ma

@tengyuma

Assistant prof. @ Stanford; Chief AI Scientist @ MongoDB; Former Co-founder/CEO of Voyage AI Working on ML, DL, RL, LLMs, and their theory.

Greg Yang

@TheGregYang

xai cofounder. fighting lyme

Sebastien Bubeck

@SebastienBubeck

I work on AI at OpenAI. Former VP AI and Distinguished Scientist at Microsoft.

Hossein Mobahi @TheGradient

3 months ago

@Google PhD Fellowship: Applications are now open! Fellowships directly support graduate students doing exceptional and innovative research in computer science and related fields as they pursue their PhD. Learn more and apply by April 30 at https://t.co/PZNtYojGOx

TheGradient retweeted

Vaishnavh Nagarajan @_vaishnavh

5 months ago

1/ We found that deep sequence models memorize atomic facts "geometrically" -- not as an associative lookup table as often imagined. This opens up practical questions on reasoning/memory/discovery, and also poses a theoretical "memorization puzzle."

244

92K

Hossein Mobahi @TheGradient

5 months ago

@mmbronstein @docmilanfar Thanks Michael! Just a bit correction. That's Arabic! In Farsi you say به امید خدا "be omide khoda" if you believe in god or امیدوارم "omidvaram" otherwise.

438

TheGradient retweeted

Dimitris Papailiopoulos

@DimitrisPapail

5 months ago

1/ New paper! "Wait, Wait, Wait… Why Do Reasoning Models Loop?" Under greedy/low-temp decoding, reasoning LLMs get stuck in loops repeating themselves, wasting test-time compute and sometimes never terminating! We study why this🔁 happens and why increasing temp is a band-aid

DimitrisPapail's tweet photo. 1/ New paper! "Wait, Wait, Wait… Why Do Reasoning Models Loop?"

Under greedy/low-temp decoding, reasoning LLMs get stuck in loops repeating themselves, wasting test-time compute and sometimes never terminating!

We study why this🔁 happens and why increasing temp is a band-aid

753

612

105K

TheGradient retweeted

Andrew Gordon Wilson

@andrewgwils

5 months ago

We introduce epiplexity, a new measure of information that provides a foundation for how to select, generate, or transform data for learning systems. We have been working on this for almost 2 years, and I cannot contain my excitement! 1/7

189

163K

TheGradient retweeted

Zeyuan Allen-Zhu, Sc.D.

@ZeyuanAllenZhu

5 months ago

Continuing Tutorial II for Physics of Language Models. We often trust large-scale results simply because they are large; but once noise is removed, the synthetic pretrain playground starts to push back — hard! The second video (Part 4.1b, 90 minutes) makes this pushback concrete. From it, I derive 20+ architectural principles, organized into 12 result blocks. Two highlights that consistently surprise even experienced readers: Result 2.1 (new): "Why Canon layers actually work." Not because of multi-token attention — that explanation only applies to the first layer. The real mechanism is how Canon reshapes hierarchical learning across depth. Result 11: "Why linear models reason 4× shallower than Transformers." This has nothing to do with memory size — it is a structural failure shared by nearly all linear architectures. In Result 12, I show which of these principles already emerge at academic-scale pretraining (1.3B / 100B) — with orders-of-magnitude lower cost and far cleaner signals than many real-life large-scale runs. The remaining principles do not disappear; they only emerge when scaling to 8B / 1T, which I will show in the third video (Part 4.2). ⏮️ Previous: Part 4.1a — methodology & playground design ▶️ This: Part 4.1b — architectural principles from the playground 🔜 Next: Part 4.2 — when the playground reshapes real-life pretraining

ZeyuanAllenZhu's tweet photo. Continuing Tutorial II for Physics of Language Models.

We often trust large-scale results simply because they are large; but once noise is removed, the synthetic pretrain playground starts to push back — hard!

The second video (Part 4.1b, 90 minutes) makes this pushback concrete.
From it, I derive 20+ architectural principles, organized into 12 result blocks.

Two highlights that consistently surprise even experienced readers:

Result 2.1 (new):
"Why Canon layers actually work."
Not because of multi-token attention — that explanation only applies to the first layer.
The real mechanism is how Canon reshapes hierarchical learning across depth.

Result 11:
"Why linear models reason 4× shallower than Transformers."
This has nothing to do with memory size —
it is a structural failure shared by nearly all linear architectures.

In Result 12, I show which of these principles already emerge at academic-scale pretraining (1.3B / 100B) —
with orders-of-magnitude lower cost and far cleaner signals than many real-life large-scale runs.

The remaining principles do not disappear; they only emerge when scaling to 8B / 1T, which I will show in the third video (Part 4.2).

⏮️ Previous: Part 4.1a — methodology & playground design
▶️ This: Part 4.1b — architectural principles from the playground
🔜 Next: Part 4.2 — when the playground reshapes real-life pretraining

705

672

187K

Hossein Mobahi @TheGradient

5 months ago

@roydanroy Congrats Dan! Can’t wait to chit chat with you at Google DeepMind!

TheGradient retweeted

Spencer Frei @sfrei_

6 months ago

I'm hiring a Student Researcher to work on scaling laws at Google DeepMind! Project is for 16 weeks, starting spring/summer '26, in-person in SF (pic from the amazing office). If you're interested, fill out this form: https://t.co/nnRmY2hqeL

sfrei_'s tweet photo. I'm hiring a Student Researcher to work on scaling laws at Google DeepMind! Project is for 16 weeks, starting spring/summer '26, in-person in SF (pic from the amazing office). If you're interested, fill out this form: https://t.co/nnRmY2hqeL https://t.co/eGTdNgAgHD

750

653

74K

Hossein Mobahi @TheGradient

6 months ago

@Azaliamirh @annadgoldie @RicursiveAI Wonderful! Wishing you best of luck.

315

Hossein Mobahi @TheGradient

6 months ago

@DorsaSadigh Killing it Dorsa! Congrats on all these🎉🎈

TheGradient retweeted

Andrew Gordon Wilson

@andrewgwils

7 months ago

Don't let people underestimate you. I remember interviewing for a postdoc at an industry lab, where I introduced spectral mixture kernels. I was told my work was "NIPS-y". It wasn't a compliment and I didn't get the position. 10 years later I was asked to autograph that paper.

519

103

49K

Hossein Mobahi @TheGradient

7 months ago

@Yuchenj_UW To be clear: A degree is not a magic wand. Classes alone don't create capability. But a PhD is a forcing function for the analytical rigor and depth required for foundational work. Can you acquire those tools without the program? Yes, but it’s a much steeper climb.

Hossein Mobahi @TheGradient

7 months ago

@Yuchenj_UW And those who created the foundations of all this (LeCun, Hinton, Bengio, and Schmidhuber) each hold a PhD. The question is where you want to contribute? Expand the breadth of what's possible with current foundations or go deep to build future foundations.

Hossein Mobahi @TheGradient

7 months ago

@modular_ell @GoogleDeepMind Deep understanding of the theory of finite-dimensional vector spaces is a "must-have" as we will need to rigorously analyze and construct proofs using concepts like vector subspaces, orthogonality, and spectral theory. Familiarity with numerical linear algebra is a nice plus.

958

Hossein Mobahi @TheGradient

7 months ago

🚨Intern Hiring🚨 Join Peter Bartlett and me at @GoogleDeepMind in Mountain View to study hierarchical learning in deep networks. Ideal for PhD students with a strong background in ML, optimization, linear algebra, and Python (JAX preferred). Apply here https://t.co/nTlEuO6Aj7

328

273

28K

Hossein Mobahi @TheGradient

7 months ago

@HazanPrinceton Sorry to hear about no slides to share, and the board was erased, but at least it presents a creative proof (by construction) for maximizing regret🤪

452

Hossein Mobahi

@TheGradient

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users