Maissam Barkeshli

@MBarkeshli

MTS at Anthropic. Professor of Physics at University of Maryland. Fellow, Joint Quantum Institute. Ex -Berkeley, MIT, Stanford, Microsoft Station Q, Meta FAIR

University of Maryland, College Park

Joined December 2011

396 Following

2.9K Followers

677 Posts

Pinned Tweet

Maissam Barkeshli

@MBarkeshli

almost 6 years ago

An absolutely incredible, highly interconnected web of ideas connecting some of the most important discoveries of late twentieth century physics and mathematics. This is an extremely abridged, biased history (1970-2010) with many truly ground-breaking works still not mentioned:

484

130

183

MBarkeshli retweeted

Surya Ganguli

@SuryaGanguli

4 months ago

Our new paper "Deriving neural scaling laws from the statistics of natural language" https://t.co/7QbrldK8Zp lead by @Fraccagnetta & @AllanRaventos w/ Matthieu Wyart makes a breakthrough! We can predict data-limited neural scaling law exponents from first principles using the structure of natural language itself for the very first time! If you give us two properties of your natural language dataset: 1) How conditional entropy of the next token decays with conditioning length. 2) How pairwise token correlations decay with time separation. Then we can give you the exponent of the neural scaling law (loss versus data amount) through a simple formula! The key idea is that as you increase the amount of training data, models can look further back in the past to predict, and as long as they do this well, the conditional entropy of the next token, conditioned on all tokens up to this data-dependent prediction time horizon, completely governs the loss! This gets us our simple formula for the neural scaling law!

SuryaGanguli's tweet photo. Our new paper "Deriving neural scaling laws from the statistics of natural language" https://t.co/7QbrldK8Zp lead by @Fraccagnetta & @AllanRaventos w/ Matthieu Wyart makes a breakthrough! We can predict data-limited neural scaling law exponents from first principles using the structure of natural language itself for the very first time!

If you give us two properties of your natural language dataset:

1) How conditional entropy of the next token decays with conditioning length.

2) How pairwise token correlations decay with time separation.

Then we can give you the exponent of the neural scaling law (loss versus data amount) through a simple formula!

The key idea is that as you increase the amount of training data, models can look further back in the past to predict, and as long as they do this well, the conditional entropy of the next token, conditioned on all tokens up to this data-dependent prediction time horizon, completely governs the loss! This gets us our simple formula for the neural scaling law!

574

118

406

62K

Maissam Barkeshli

@MBarkeshli

4 months ago

Nice to make some progress on a basic topic in theoretical physics -- universal response of emergent Dirac fermions to crystal defects. Now published in @PhysRevX , with @ZoharKo , @CFechisin , Siwei Zhong. @JQInews @UMDPhysics

MBarkeshli's tweet photo. Nice to make some progress on a basic topic in theoretical physics -- universal response of emergent Dirac fermions to crystal defects. Now published in @PhysRevX , with @ZoharKo , @CFechisin , Siwei Zhong. @JQInews @UMDPhysics https://t.co/L8FxGV4dRx

Physical Review X @PhysRevX

4 months ago

By studying how symmetries of lattice models map to symmetries of continuum theories that emerge at criticality, researchers show that the (2+1)D Dirac fermion exhibits a continuum of infrared fixed points at a particular applied magnetic flux. https://t.co/ZviF6Ra7dh

PhysRevX's tweet photo. By studying how symmetries of lattice models map to symmetries of continuum theories that
emerge at criticality, researchers show that the (2+1)D Dirac fermion exhibits a continuum of infrared fixed points at a particular applied magnetic flux.

https://t.co/ZviF6Ra7dh https://t.co/OESMvAz51W

Maissam Barkeshli

@MBarkeshli

5 months ago

@GoonGarrett No, we have no understanding so far. I think it is relatively robust to model size but we didn’t do a careful study.

Who to follow

Isaac Kim

@Isaac__kim

Working on quantum information, computation, and many-body physics. Assistant Professor of Computer Science @ UC Davis.

Victor V. Albert

@victorvalbert

Theoretical physicist @NIST, Fellow @JointQuICS, Zookeeper @theeczoo. Views my own.

Condensed Matter Theory Center

@condensed_the

Condensed Matter Theory Center (CMTC) at @UofMaryland is a research center for condensed matter physics.

Maissam Barkeshli

@MBarkeshli

5 months ago

Our ICLR 2026 paper shows how transformers can learn pseudo-random numbers. We demonstrate successful in-context prediction of pseudo-random sequences from permuted congruential generators, which are used in practice in NumPy. We succesfully attacked PCGs with moduli up to 2^22. Surprisingly, the transformer can learn the sequence even when only one bit is output from the hidden state. We found that curriculum learning is essential for these problems. We also found novel structures in the embedding layers: the model spontaneously clusters numbers according to how their bit strings transform under rotations.

MBarkeshli's tweet photo. Our ICLR 2026 paper shows how transformers can learn pseudo-random numbers.

We demonstrate successful in-context prediction of pseudo-random sequences from permuted congruential generators, which are used in practice in NumPy. We succesfully attacked PCGs with moduli up to 2^22.

Surprisingly, the transformer can learn the sequence even when only one bit is output from the hidden state.

We found that curriculum learning is essential for these problems.

We also found novel structures in the embedding layers: the model spontaneously clusters numbers according to how their bit strings transform under rotations.

556

Maissam Barkeshli

@MBarkeshli

5 months ago

The full story is in our recent paper, https://t.co/W5wUelE6Tr . Thanks to @Andr3yGR @albe_alfa for the collaboration

668

Maissam Barkeshli

@MBarkeshli

5 months ago

Scaling laws in AI – where do they come from? The discovery of neural scaling laws several years ago showed that the loss decreases predictably as a power law in model size, amount of data, and compute. But why? And what sets the exponents of the power law? The most popular explanation is that the dataset already has power law correlations in it (for example, power laws are prevalent in natural language corpora, e.g. Zipf’s law, etc), which translate to power laws in the loss. We studied transformers performing next token prediction on sequences coming from random walks on random graphs, where the data has no power law correlations. Nevertheless, after training the model, we observed power laws in the loss that look similar to those found in natural language. For example, here are results from a random walk on an Erdös-Renyi graph with 8K edges and 50K nodes: This challenges existing explanations, since this dataset of random walks falls outside of the assumptions made in existing models of scaling laws. Going forward, we need explanations of scaling laws based on expressivity and learnability of discrete data, where there is no data manifold, and which do not require the data to already have power laws built in. We also found a setting where we could tune the complexity of a language dataset by starting with a bigram model and gradually dialing up complexity until we get to natural language. This allowed us to track how the exponents of the scaling laws change with complexity:

MBarkeshli's tweet photo. Scaling laws in AI – where do they come from?

The discovery of neural scaling laws several years ago showed that the loss decreases predictably as a power law in model size, amount of data, and compute. But why? And what sets the exponents of the power law?

The most popular explanation is that the dataset already has power law correlations in it (for example, power laws are prevalent in natural language corpora, e.g. Zipf’s law, etc), which translate to power laws in the loss.

We studied transformers performing next token prediction on sequences coming from random walks on random graphs, where the data has no power law correlations. Nevertheless, after training the model, we observed power laws in the loss that look similar to those found in natural language. For example, here are results from a random walk on an Erdös-Renyi graph with 8K edges and 50K nodes:

This challenges existing explanations, since this dataset of random walks falls outside of the assumptions made in existing models of scaling laws. Going forward, we need explanations of scaling laws based on expressivity and learnability of discrete data, where there is no data manifold, and which do not require the data to already have power laws built in.

We also found a setting where we could tune the complexity of a language dataset by starting with a bigram model and gradually dialing up complexity until we get to natural language. This allowed us to track how the exponents of the scaling laws change with complexity:

MBarkeshli retweeted

Surya Ganguli

@SuryaGanguli

7 months ago

We have 14 survey lectures for our @SimonsFdn Collaboration on the Physics of Learning and Neural Computation! All videos available at: https://t.co/MLnVYY6Fhh Here is the list: @zdeborova: Attention-based models and how to solve them using tools from quadratic networks and matrix denoising @KempeLab: Recent lessons from LLM reasoning @MBarkeshli: Sharpness dynamics in neural network training @KrzakalaF: How Do Neural Networks Learn Simple Functions with Gradient Descent? Michael Douglas: Mathematics, Economics and AI Yuhai Tu: Towards a Physics-based Theoretical Foundation for Deep Learning: Stochastic Learning Dynamics and Generalization @SuryaGanguli: An analytic theory of creativity for convolutional diffusion models Eva Silverstein: Hamiltonian dynamics for stabilizing neural simulation-based inference @adnarim066: Generation with Unified Diffusion Bernd Rosenow: Random matrix analysis of neural networks: distinguishing noise from learned information @jhhalverson Nerual networks and conformal field theory @KempeLab Synthetic data: friend or foe in the age of scaling @WyartMatthieu Learning hierarchical representations with deep architectures @CPehlevan Mean-field theory of deep network learning dynamics and applications to neural scaling laws

249

213

22K

Maissam Barkeshli

@MBarkeshli

8 months ago

From IBM 1960. old but fresh

494

Maissam Barkeshli

@MBarkeshli

9 months ago

looks interesting

940

Maissam Barkeshli

@MBarkeshli

9 months ago

Wonderful workshop at Harvard on AI + math, I’m grateful to have been a part of it!

Eve Bodnia

@evelovesolive

9 months ago

Thank you so much to everyone for this wonderful dinner! I’m truly grateful to Harvard University CMSA for this amazing experience. It makes me so happy to see the Math & AI community growing, can’t wait to see all the incredible things these brilliant minds will create together

evelovesolive's tweet photo. Thank you so much to everyone for this wonderful dinner! I’m truly grateful to Harvard University CMSA for this amazing experience. It makes me so happy to see the Math & AI community growing, can’t wait to see all the incredible things these brilliant minds will create together https://t.co/132xzsMv72

32K

940

MBarkeshli retweeted

Patrick Shafto @patrickshafto

9 months ago

Many thanks to the speakers! Amazing week. @MBarkeshli @evelovesolive Adam Brown, Bennett Chow, Michael Freedman, @ElliotGlazer @jhhalverson @jessemhan @jdlichtman Junehyuk Jung @AlexKontorovich @ylecun Brice Ménard, Michael Mulligan, also Michael R. Douglas

MBarkeshli retweeted

Surya Ganguli

@SuryaGanguli

9 months ago

A nice @Stanford news report on how university research is essential for understanding AI and sharing these insights openly with the world. https://t.co/l252TS6ro1

10K

MBarkeshli retweeted

Simons Foundation

@SimonsFdn

9 months ago

Under the leadership of @Stanford's @SuryaGanguli, our new Simons Collaboration on the Physics of Learning and Neural Computation will study the fundamental scientific principles underlying AI: https://t.co/XXIAX0OWvM #science

Maissam Barkeshli

@MBarkeshli

10 months ago

I’m excited to be part of the new @SimonsFdn Simons Collaboration on the Physics of Learning and Neural Computation!

Simons Foundation

@SimonsFdn

10 months ago

Our new Simons Collaboration on the Physics of Learning and Neural Computation will employ and develop powerful tools from #physics, #math, computer science and theoretical #neuroscience to understand how large neural networks learn, compute, scale, reason and imagine: https://t.co/fqNqtJjWKg

230

173K

Maissam Barkeshli

@MBarkeshli

11 months ago

@ZoharKo the rate of improvement is astonishing

157

Maissam Barkeshli

@MBarkeshli

11 months ago

@ZoharKo the new gold-winning openai model is not released yet

Maissam Barkeshli

@MBarkeshli

11 months ago

@ZoharKo https://t.co/0xxfgWul3f

Alexander Wei

@alexwei_

11 months ago

1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).

alexwei_'s tweet photo. 1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO). https://t.co/SG3k6EknaC

397

173

MBarkeshli retweeted

Dayal Kalra

@dayal_kalra

11 months ago

🤖 Transformers can write poetry, code, and generate stunning art, but can they predict seemingly random numbers? We show that they learn to predict simple PRNGs (LCGs) by figuring out prime factorization on their own!🤯 Find Darshil tomorrow, 11am at #ICML2025 poster session!

MBarkeshli retweeted

Andrey Gromov

@Andr3yGR

12 months ago

New paper! Collaboration with @TianyuHe_ and Aditya Cowsik. Thread.🧵

172

168

26K

Maissam Barkeshli

@MBarkeshli

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users