Today, we’re sharing that a general-purpose internal @openai model achieved a breakthrough on one of the best-known combinatorial geometry problems. Less than 1 year ago frontier AI models were at IMO gold-level performance. I expect this pace of progress to continue.
Today, we share a breakthrough on the planar unit distance problem, a famous open question first posed by Paul Erdős in 1946.
For nearly 80 years, mathematicians believed the best possible solutions looked roughly like square grids.
An OpenAI model has now disproved that belief, discovering an entirely new family of constructions that performs better.
This marks the first time AI has autonomously solved a prominent open problem central to a field of mathematics.
GPT-5.2 solves our COLT 2022 open problem: “Running Time Complexity of Accelerated L1-Regularized PageRank” using a standard accelerated gradient algorithm and a complementarity margin assumption.
Link to the open problem: https://t.co/A3ZbJshudE
All proofs were generated by GPT-5.2 Pro. The key bounds on the algorithm’s total work (in the COLT’22 open-problem setting) have been auto-formalized using a combination of GPT-5.2 Pro, @HarmonicMath's Aristotle, and Gemini 3 Pro (High) on Antigravity.
Link to the proof: https://t.co/hgJ0iBcWJe
Link to the Lean code: https://t.co/DeMFDlwSC9
Link to the informalization of the Lean code: https://t.co/V5BwYoIycN
Link to my GPT-5.2 prompts: https://t.co/xwh5c6S81B
In addition to the formalization of the main result, I checked the proof myself twice. I hope I didn’t miss anything, but if I did, please let me know and I will try to fix it.
Story behind the paper and relevant work
In 2016, I worked on the convergence rate of the Iterative Soft-Thresholding Algorithm (ISTA) for l1-regularized PageRank.
Link to the corresponding paper: https://t.co/pDMN9QKkGh
Surprisingly, the running time of the algorithm depends only on the number of non-zero nodes at optimality. It was only natural to ask the same question for accelerated methods, such as FISTA. However, we quickly realized that FISTA activates more nodes than the number of non-zeros at optimality, even though it eventually converges to the same active set. In practice, we would still observe that FISTA is fast.
Link to empirical work: https://t.co/VQFJugQk0m
I tried for about three months to bound the total work of FISTA and other accelerated algorithms, and from time to time I would come back to the problem while I was a postdoctoral fellow. Eventually, I gave up. I gave it another try around 2021, and I failed again. I asked my excellent former student, Shenghao Yang, and he also failed, unfortunately. I asked a couple of prominent researchers if they think the problem is solvable, they quickly mentioned that it seemed hard. We ended up publishing it as an open problem at COLT 2022.
In 2023, David Martínez-Rubio et al. provided the first successful solution. Their solution is “orthogonal” to what was proved by GPT-5.2.
Link to their paper: https://t.co/YPUrfGhG2T I loved their work btw, I also met David in person at ICML 2024, one of the few ML conferences I ever attended.
Their proposed accelerated algorithm is not necessarily faster than ISTA; however, it does offer a new trade-off between the teleportation parameter of PageRank and the total work per iteration. More importantly, the proposed method isn’t necessarily practical, since it involves solving an expensive subproblem. To be fair, in the COLT 2022 problem, we didn’t impose the additional hard constraint of using standard accelerated methods. The problem was posed as a theoretical problem. The solution proved by GPT-5.2 establishes acceleration for the standard FISTA algorithm, which performs only one gradient computation per iteration. It also offers a clean parameterization of the total work with respect to a complementarity margin, which, for certain graph structures, shows a clear speed-up compared to ISTA.
In 2024, Zhou et al. (https://t.co/Agq5ANfhuS) gave it another go. However, in my view, their work has important drawbacks. In particular, their guarantees for accelerated localized methods (e.g., localized Chebyshev / Heavy-Ball) assume a condition on the geometric mean of certain active-ratio factors (described as Θ(\sqrt{α})) in order to obtain an accelerated bound.
Two distinctions matter for our setting:
First, their accelerated runtime bounds are parameterized by evolving-set quantities and a residual-ratio assumption, which can be evaluated during a run but is not typically interpretable or verifiable a priori from graph structure alone. The solution by GPT-5.2 instead provides an explicit transient-phase bound in terms of a standard optimization-structure condition, and converts this directly into a total work bound.
Second, they explicitly note that FISTA-style acceleration violates the monotonicity property needed to bound the per-iteration accessed volume, and emphasize that guaranteeing intermediate sparsity in accelerated frameworks is challenging. The margin-based analysis by GPT-5.2 directly targets this gap: even without any monotonicity of intermediate supports, GPT-5.2 bounded how much spurious activation can occur before the iterates enter a neighborhood of the unique minimizer, thereby yielding a concrete locality certificate for the accelerated proximal-gradient trajectory.
Since 2024, every time OpenAI or Google released a new major model, I would give it a go. This time, with GPT-5.2, it seems to have worked.
JUST IN: Claude, the California Academy of Sciences’ rare albino alligator and one of San Francisco’s most recognizable residents, has died at age 30. https://t.co/Z3iD49p7vZ
I never got why there's a big group that seem to split on value based vs policy based. Somehow the policy based folks think they don't need to learn any formalism /math/theory and just can guess zeroth order gradient estimstor? But every policy opt that works uses some variance reduction technique that comes from thinking about mdps.
>read proof by X
>makes no sense
>I’ll figure it out myself
>work hard, finally get it
>write it down
>it’s exactly the same as the proof by X
Can Transformers Do Everything, and Undo It Too?
Check out my blog on whether language models are surjective, injective, or invertible!
https://t.co/9v0gd2962J
Announcing the first workshop on Foundations of Post-Training (FoPT) at COLT 2025!
📝 Soliciting abstracts/posters exploring theoretical & practical aspects of post-training and RL with language models!
│
🗓️ Deadline: May 19, 2025
PhD students: Remember to apply for the Google PhD fellowship. It will make your PhD super smooth.
Application opens on 10th April 2025
Deadline: 15th May 2025
Thinking for longer (e.g. o1) is only one of many axes of test-time compute. In a new @Google_AI paper, we instead focus on scaling the search axis. By just randomly sampling 200x & self-verifying, Gemini 1.5 ➡️ o1 performance. The secret: self-verification is easier at scale!
Hi! Fitting all answers into a single context window doesn't seem to work great for problem solving... I usually just have models do a pass on each attempt individually to weed out the clearly dumb ones, and then run pairwise (k-wise for k>2 doesnt help much imo) comparisons to tie-break between the plausible candidates; this is what we did in the paper. A combination of that + applying search to the verification problem suffices, at least to the extent that the main bottleneck becomes generation not verification
For info-retrieval problems, putting them all into same window usually works well enough. When it doesn't (like I'm trying to merge multiple arxiv papers), i either "merge" them into the aggregation one-by-one or by having a model group them semantically, merge within groups, and then merge between groups
@ddkang@Google_AI Sorry it's been stuck in the google open sourcing process for a while... the new arxiv version should include all prompts + parameters necessary for duplication, but also if you reach me over email i can give you code directly so you can get setup.
Thanks for the interest :)
Hm I guess you're asking what percentage of that pass@k - pass@1 we can actually capture? I think it depends on the problem.
On multiple choice exams pass@k might go to 100% but the model might not actually reach it correctly ever so pass@k far exceeds perf of search x k
On problems where youre unlikely to run into the correct answer (or if you let pass@k only count correct proofs + answers), then it depends on how easy verification is. On AIME, theres basically no gap at scale; on livebench reasoning puzzles you can get like 80%; for my personal theory research usage probably close to 80% too
Thanks! I guess it's hard to draw a clean boundary, bc RL-trained models do learn to perform search serially in their thinking traces. But I'd say most of their gains are actually attributable to backtracking, going in more detail, self-prompting---which search scaling doesnt overlap with and should stack on top of.
On orthogonality, even reasoning models benefit significantly from parallel search; you can get a good sense of this by just comparing their pass@1 and pass@k (can they get something right in k tries). If anything, i feel like the reasoning models ive used are *more* ergodic on hard proofs
@ElsheikhTech That's what you should be doing; we focused on search in this paper bc we wanted to understand it better, but in our workflows we're applying these to reasoning models