This is cool but disingenuous framing imo. All of the records were marginal, building on other records, so “out-outperformed all 1,016 other researchers” is a stretch lol
I’m curious what this system’s performance would be if it couldn’t build on human records
OpenAI ran a hiring challenge, but the top candidate was one they couldn’t hire: our autonomous research agent, Aiden.
In Parameter Golf, Aiden ran for 22 days, and out-outperformed all 1,016 other researchers: 🧵 (1/8)
@YifeiZuoX@Haoxiang__Wang@kellerjordan0 I mean compute-matched and param-matched during training.
e.g. both models at .6B and trained for 1e23 FLOPs (for example)
@YifeiZuoX@Haoxiang__Wang@kellerjordan0 Thanks for the response — and great paper! Isn’t parallax better compute-matched at .6b? Or am I missing something? And any reason the param-matched transformer and compute-matched parallax aren’t included for 1.7b?
@whatthelukh Great post! wdyt about CTPO as a lower-variance-but-still-unbiased estimate of IS ratio? I'm curious how it would do on your "Simple Horizon Simulation" at small batch sizes
🚨 Before concluding:
As noted by @Sam_Acqua and many others, we all ought to be very skeptical of Gen PPL as a metric, especially in isolation. ❌ It is actually a bit crazy that we have been using it for so long. Hence, the additional metrics, and the presence of several qualitative samples in the appendix. Please have a look yourselves to get a better understanding of the sample quality! 🔍 There's comparisons across SD/non-SD, number of NFEs, and others.
The grad moment metric addresses the "no measure of inter-sample diversity" point, but it still has some of the same flaws:
1. still sensitive to sample entropy / slight differences in sampling (but it is *less* sensitive)
2. still treating gpt2 as oracle
and new issue is that it is much less interpretable. I still think val ppl bounds like langflow are the way forward even though they are relatively expensive. Compared to training the model it is negligible.
Flow models are a promising alternative to autoregression. But the current standard of evaluating flow models is broken.
The reported 3x improvement in 1024-step PPL since 2023 is closer to 1.1x if you control for sample entropy.
(1/12)
@emiel_hoogeboom Very cool! I feel like its best to also standardize # NFEs to make comparison easier, given that comparing 1024 NFEs vs 4 NFEs is a bit unfair.
Fully agreed re: entropy-matched gen ppl is still very flawed. gpt2 grad moment is an interesting idea
@Clashluke I mean the best would be domain expert + something like HEBO right? But obv it's a pain to define the joint prior distribution.
Second best being autoresearch + HEBO like https://t.co/1RZaVWCtir, maybe best if SOTA = on pareto of human researcher v. loss
@giladturok DUEL is awesome I reference it all the time. I didn’t mention it because I only referenced diffusions that were cited in flow model papers. But I’ll add a note about it once I have time :)
@jwthickstun@Chramblin@nmboffi@ReeceShuttle@akshayvegesna Interesting idea! I haven't seen Mauve before -- I'd have to look closer before giving any useful response.
Thanks for reading :) I'll reply once I have time to check it out