Sam Acquaviva

Verified account

@Sam_Acqua

now @ after thought, ex-cocosci @ mit

Joined February 2014

384 Following

370 Followers

80 Posts

about 22 hours ago

This is cool but disingenuous framing imo. All of the records were marginal, building on other records, so “out-outperformed all 1,016 other researchers” is a stretch lol I’m curious what this system’s performance would be if it couldn’t build on human records

1 day ago

OpenAI ran a hiring challenge, but the top candidate was one they couldn’t hire: our autonomous research agent, Aiden. In Parameter Golf, Aiden ran for 22 days, and out-outperformed all 1,016 other researchers: 🧵 (1/8)

14

476

48

275

81K

2

13

0

1

2K

1 day ago

@YifeiZuoX awesome -- thank you for running this

0

2

0

0

78

2 days ago

@YifeiZuoX @Haoxiang__Wang @kellerjordan0 I mean compute-matched and param-matched during training. e.g. both models at .6B and trained for 1e23 FLOPs (for example)

1

2

0

0

124

2 days ago

@YifeiZuoX @Haoxiang__Wang @kellerjordan0 Thanks for the response — and great paper! Isn’t parallax better compute-matched at .6b? Or am I missing something? And any reason the param-matched transformer and compute-matched parallax aren’t included for 1.7b?

1

2

0

0

128

Who to follow

Felipe Oliveira

@FelipeOliverAI

I'm a Machine Learning Engineer who dreams to make the world safe and better through AI.

Verified account

cto @steadwing • ex @mem0ai • love open source • shipped 1 failed product

#1 Matthew Lillard stan

3 days ago

@whatthelukh Great post! wdyt about CTPO as a lower-variance-but-still-unbiased estimate of IS ratio? I'm curious how it would do on your "Simple Horizon Simulation" at small batch sizes

0

0

0

0

625

7 days ago

@redtachyon + w/ discount factor = 1 & 0/1 rewards, reinforce = rejection sampling

0

3

0

0

230

12 days ago

@_TarunKathuria ah, makes sense -- thanks

0

1

0

0

22

12 days ago

@_TarunKathuria re: "I don’t think Aurora is a great solution for a variety of reasons" can you say more? I'm curious

1

0

0

0

26

16 days ago

Great work from Oscar on scaling up flow models!

Oscar Davis @osclsd

16 days ago

🚨 Before concluding: As noted by @Sam_Acqua and many others, we all ought to be very skeptical of Gen PPL as a metric, especially in isolation. ❌ It is actually a bit crazy that we have been using it for so long. Hence, the additional metrics, and the presence of several qualitative samples in the appendix. Please have a look yourselves to get a better understanding of the sample quality! 🔍 There's comparisons across SD/non-SD, number of NFEs, and others.

osclsd's tweet photo. 🚨 Before concluding:

As noted by @Sam_Acqua and many others, we all ought to be very skeptical of Gen PPL as a metric, especially in isolation. ❌ It is actually a bit crazy that we have been using it for so long. Hence, the additional metrics, and the presence of several qualitative samples in the appendix. Please have a look yourselves to get a better understanding of the sample quality! 🔍 There's comparisons across SD/non-SD, number of NFEs, and others.

osclsd's tweet photo. 🚨 Before concluding:

As noted by @Sam_Acqua and many others, we all ought to be very skeptical of Gen PPL as a metric, especially in isolation. ❌ It is actually a bit crazy that we have been using it for so long. Hence, the additional metrics, and the presence of several qualitative samples in the appendix. Please have a look yourselves to get a better understanding of the sample quality! 🔍 There's comparisons across SD/non-SD, number of NFEs, and others.

osclsd's tweet photo. 🚨 Before concluding:

As noted by @Sam_Acqua and many others, we all ought to be very skeptical of Gen PPL as a metric, especially in isolation. ❌ It is actually a bit crazy that we have been using it for so long. Hence, the additional metrics, and the presence of several qualitative samples in the appendix. Please have a look yourselves to get a better understanding of the sample quality! 🔍 There's comparisons across SD/non-SD, number of NFEs, and others.

1

11

1

0

1K

1

4

0

0

701

17 days ago

The grad moment metric addresses the "no measure of inter-sample diversity" point, but it still has some of the same flaws: 1. still sensitive to sample entropy / slight differences in sampling (but it is *less* sensitive) 2. still treating gpt2 as oracle and new issue is that it is much less interpretable. I still think val ppl bounds like langflow are the way forward even though they are relatively expensive. Compared to training the model it is negligible.

0

1

0

0

17

22 days ago

Flow models are a promising alternative to autoregression. But the current standard of evaluating flow models is broken. The reported 3x improvement in 1024-step PPL since 2023 is closer to 1.1x if you control for sample entropy. (1/12)

7

164

30

138

50K

17 days ago

@emiel_hoogeboom Very cool! I feel like its best to also standardize # NFEs to make comparison easier, given that comparing 1024 NFEs vs 4 NFEs is a bit unfair. Fully agreed re: entropy-matched gen ppl is still very flawed. gpt2 grad moment is an interesting idea

1

2

0

0

37

20 days ago

@pratyushbuddiga Thanks Pratyush!

0

2

0

0

80

20 days ago

@yule_gan wdyt about the relation to https://t.co/yBIdk3gPCt

1

1

0

1

241

20 days ago

@sedielem Thanks Sander!

0

2

0

0

285

21 days ago

@Clashluke I mean the best would be domain expert + something like HEBO right? But obv it's a pain to define the joint prior distribution. Second best being autoresearch + HEBO like https://t.co/1RZaVWCtir, maybe best if SOTA = on pareto of human researcher v. loss

0

3

1

1

286

21 days ago

@giladturok DUEL is awesome I reference it all the time. I didn’t mention it because I only referenced diffusions that were cited in flow model papers. But I’ll add a note about it once I have time :)

1

2

0

0

110

22 days ago

@jwthickstun @Chramblin @nmboffi @ReeceShuttle @akshayvegesna Interesting idea! I haven't seen Mauve before -- I'd have to look closer before giving any useful response. Thanks for reading :) I'll reply once I have time to check it out

0

0

0

0

28

22 days ago

@LucaAmb @PatrickPyn35903 Yes let's do it! I'll dm

0

2

0

0

43

Last Seen Users on Sotwe

Trends for you

Most Popular Users