Mathematics as a field is going to have to reorient itself in light of powerful AI. But a slight pushback to Gowers's comment:
"If LLMs are at the point where they can solve 'gentle problems', ...the lower bound for contributing to mathematics will now be to prove something that LLMs can’t prove, rather than simply to prove something that nobody has proved up to now and that at least somebody finds interesting."
Mathematics is infinite and thus inexhaustible. By having powerful AIs that can do heavy lifting, more of the burden is shifted towards taste and asking the right question. The possibility of discovering something by looking in the right place that everyone else missed becomes possible. In mathematical physics for instance, an Einstein with inspiration of the equivalence principle might not have to toil for a decade to invent general relativity, but could have equations proposed, their solutions found, and scenarios validated as limits of Newtonian physics. Contributing to mathematics, rather than having the bar raised for problem-solving, has opened up for ideation and generation.
@xuanalogue looked at your CLIPS paper, so yes, an AI that truly infers a student's hidden goals and epistemic state might enable persistence instead of enabling shortcuts. :)
Paper: “Demystifying Oversmoothing in Attention-Based Graph Neural Networks” (NeurIPS 2023, spotlight)
By Xinyi Wu, Amir Ajorlou, Zihui Wu & @jababi at MIT/Caltech.
Key move: they model attention-based GNNs as nonlinear time-varying dynamical systems and use joint spectral radius theory to prove oversmoothing is inevitable for GCNs, GATs, and graph transformers.
Covers ReLU, LeakyReLU, GELU, SiLU.
No architectural trick escapes it. The only way out is rethinking how depth is applied.
📄 https://t.co/2YtgmG5FqJ
Everyone thought attention would solve oversmoothing in GNNs.
It doesn’t. It can’t.
Rigorous proof: expressive power in attention-based GNNs collapses exponentially with depth. GATs, graph transformers - none are immune.
The real insight?
Depth shouldn’t be uniform. A boundary node sitting between two communities needs 2 layers. An interior node in a dense cluster might need 10. Treating them the same is the actual problem.
Structure should dictate depth. Not the other way around.
This nomenclature always confused me! NP hard sounds like it's a subset of NP, but NP is verifiable, and NP hard is hard to solve.
Knuth suggested three names "Herculean", "Formidable", and "Arduous", and sent out a poll to people in theory community. one write-in suggestion was "Hard-Ass Problems" (Hard As Satisfiability).
Bell Labs won with "NP-hard" and they've been confusing people ever since. The real NP-hard problem was naming NP-hard.
Underlying reason:
Continuity and symmetry induce equivalence classes over inputs.
Transformers collapse nearby sequences into the same representation orbit.
Perplexity is invariant on these orbits.
Correctness is not.
This was never about Perplexity the company.
It is about algebra, group actions, and quotient spaces.
Perplexity is not always right.
It can appear confident and rigorous, and it can score extremely well by its own metric, while still producing an incorrect prediction.
This is not a bug or a training artifact.
The result comes from the paper
“Perplexity Cannot Always Tell Right from Wrong”
This insight leads to a set of fundamental group theory based results.
I have tried to characterize which forms of node-level memorization are inevitable in GNNs and which require symmetry breaking.
Paper coming after review.
Hot take: a lot of GNN memorization isn’t learned at all.
It’s forced.
Graph symmetry + training dynamics decide what a GNN can and cannot memorize — before data even enters the picture.
Three claims/theorems about deep learning that seem difficult to disprove and even harder to prove:
A) Gradient descent does more than minimize loss. It reshapes geometry by collapsing directions that are irrelevant to the task
(gradient flow induces anisotropic contraction in the pullback metric, with decay along directions orthogonal to the loss gradient).
B) Symmetry does not need to be imposed. When data and objectives are invariant, training dynamics tend to uncover quotient structure implicitly
(optimization trajectories concentrate on equivalence classes induced by approximate group orbits, even without architectural equivariance).
C) Memorization is not storage. It is the emergence of extremely sharp decision geometry confined to negligible-volume regions
(interpolation is achieved via high-curvature decision boundaries localized to sets of vanishing measure in input space).
These are not easy theorems.
But they feel like the right ones to chase.
Genuinely looking for advice, counterexamples, or references from people thinking deeply about this:
@levie_ron@kamalikac@rsalakhu@ok1zjf@neelnanda5@mmbronstein
A doubly stochastic matrix only redistributes values.
It cannot amplify them or destroy them.
Geometrically, it is a soft mixture of permutations.
It shuffles and mixes, but conserves total signal.
Identity is one extreme case of this.
So mHC does not abandon the identity idea.
It generalizes it.
Identity becomes a stable geometric object instead of a single point.
That is the breakthrough:
deep learning stability enforced by geometry, not tricks.
The DeepSeek mHC paper is a real breakthrough, and the reason is geometric, not architectural.
Early neural networks were just repeated matrix multiplications: x <- W x.
Depth was unstable.
ResNets changed one line:
x <- x + F(x)
which linearizes to x <- (I + W)x.
That single identity term is what made deep learning scale.
Hyper-Connections broke this by replacing identity with a learned matrix, turning depth back into unconstrained matrix products.
mHC fixes this in a principled way.
Instead of identity or an arbitrary matrix, mHC uses a doubly stochastic one.
Doubly stochastic matrices form the Birkhoff polytope.
They are convex combinations of permutations.
Geometrically, the residual stream undergoes conservative transport and mixing, not amplification or decay.
Identity is just one extreme point of this space.
Under composition, stability is preserved.
mHC does not abandon identity.
It generalizes it into a stable geometric object.
This is not an engineering trick.
It is linear algebra and geometry doing the real work
That learned matrix gets applied again and again across layers.
Now depth is no longer identity plus correction.
It is repeated application of an unconstrained matrix.
We are back to the original instability problem.
mHC fixes this by using geometry.
Instead of letting the identity be any learned matrix, it restricts it to a special space called doubly stochastic matrices.
No math needed. Here is the intuition.