@martinmbauer The ‘weirdness’ is the gap between logic and intuition.
If you claim it's not weird, you need to give an intuitive explanation, not a logical one.
You did a trivial manipulation of symbols using a system humanity spent 1000s of years developing precisely to overcome that gap.
@dylan522p@alexandr_wang Scale AI is a glorified mechanical turk, it's strange how all major AI labs are handing over vast amounts of proprietary data to them (the RLHF model interaction data). What does Scale do with this data?
@Evan_Mann@balajis Verifying an AI output in this context doesn't mean cryptographically verifying it, it means "is the answer to the math problem I gave it actually correct, or is it hallucinating?"
@ck_oro What about the data that isn't easily passively acquired by plugging yourself in? I.e., how would you collect the data that is the 'chain of thought' of a professional mathematician (not just solving a math competition problem, but conducting research, warts and all)
@xlr8harder@natolambert It's all infrastructure and data, the engineers are hopping from one company to the next, so there's no secret architecture or training method (if there was then it doesn't last long).
@ElliotGlazer There’s little point evaluating a model trained on informal math on this (o3)… of course they wouldn’t be able to answer! I wonder how AlphaProof would do, or a fine-tuned o3 like you say
@thomasahle@wtgowers NuminaMATH dataset scraped all competition problems (regional olympiads, nations, international, and short listed) and Putnam completion problems, and this dataset is 860k, so I’m guessing it’s similar to this dataset. There’s aboit 100k actual hard questions in total, rest easy
@LaurentSartran@hbouammar@GoogleDeepMind@JeffDean Would you say data is a bottleneck? I assume you used something like NuminaMath, with about 100k 'difficult' questions. The model is able to construct many variations within vicinity of problems, but how would an extra 100k unique interesting problems (olympiad-ish level) help?
@LaurentSartran@hbouammar@GoogleDeepMind@JeffDean How come the fine-tuned gemini model isn't able to automatically formalize the problem? I realize you need to be certain that the translation is correct, but didn't you automatically formalize 1million questions during training? Great work btw (I know I'm late!)
@KaixuanHuang1 It would be good if you released the perturbed dataset you created because with the one example you gave I found o1, GPT-4o, and DeepSeek R1 all gave correct answers, and I repeated many times.
@Enzorouxx Good summary! RL is limitless when we have a verifiable generator AND verifiable discriminator (e.g. Go) but in domains like formal math, we only have a verifiable discriminator (Lean) but no way to systematically generate valid problem-solution pairs beyond a brute force search