Uzay @uzpg_ - Twitter Profile

about 11 hours ago

making my agent observe all the plots I found interesting, suggest plots computed from all the data for the project, and update based on my feedback

0

1

0

56

Uzay

@uzpg_

about 12 hours ago

*spawn agents/pure models, depending on the variant

0

128

Uzay

@uzpg_

about 12 hours ago

RLM gains probably come from the fact that having the model reason in abstractions to spawn agents conditions it to spend much more inference tokens and kinds of thinking than if u just ran it on the task, or asked it to spawn agents.

2

1

0

1

318

Uzay

@uzpg_

about 12 hours ago

a year back @kushal1t @KaivuHariharan @atticuswzf found similar findings when we made models generate their own tools in code to play games, like in chess, creating various useful heuristics they wouldn't employ by default, even if they could

0

1

0

83

Who to follow

laura

@laurgao

electrical engineering @pennmandt; editor @34ST prev sparc, zapata, atlas

William Zhang

@WilliamHYZhang

Cofounder @Endgrate | Previously @OpenAI @Clay | CS @Harvard

Klaus Desmet

@klausvanieper

Economist @SMU, @nberpubs, and @cepr_org, Alum @Stanford, @uclouvain_be, @UWCUSA and @CollegeIeper

Uzay

@uzpg_

about 12 hours ago

because of the effect of thinking in code at the level of the multi-agent scheme conditions the model a lot, and by default it's under-elicited

1

0

195

Uzay

@uzpg_

1 day ago

the elicitation overhang is bigger than you think.

0

75

uzpg_ retweeted

Carl Guo

@CarlGuo866

3 days ago

Just graduated from MIT with my BS and MEng!

16

380

7

12

22K

Uzay

@uzpg_

2 days ago

in https://t.co/HGdpwsq1mm we found that interventions that reduced data diversity, effectively decreasing this gradient interference, allowed networks to move faster to generalizing solutions, whereas here they study the ability (it seems, haven't read in detail yet) to memorize rarer tasks

0

1

207

Uzay

@uzpg_

2 days ago

cool work!

Ekdeep Singh Lubana @EkdeepL

3 days ago

Very excited to have this paper out! We show by having more parameters, larger models see reduced interference between updates. This allows them to retain memories of rarely observed samples of a task, eventually allowing them to learn even the tail-end of the distribution. (1/3)

4

182

19

88

15K

1

5

0

2

760

Uzay

@uzpg_

2 days ago

models always put irrelevant context from the convo or origin into the documentation or prompts you have them create

0

4

0

1

129

uzpg_ retweeted

Cas (Stephen Casper)

@StephenLCasper

3 days ago

Just finished my PhD at @MITCSAIL. In July, I'll start as an assistant professor at the @Harvard @Kennedy_School. I have lots to learn and lots to do. With others (some TBA 👀) at HKS, I'm looking forward to helping academia offer guidance for governing the next chapters of AI.

StephenLCasper's tweet photo. Just finished my PhD at @MITCSAIL. In July, I'll start as an assistant professor at the @Harvard @Kennedy_School. I have lots to learn and lots to do.

With others (some TBA 👀) at HKS, I'm looking forward to helping academia offer guidance for governing the next chapters of AI. https://t.co/UpSU509i5r

71

806

25

48

51K

uzpg_ retweeted

Daniel Litt

@littmath

8 days ago

STEM academia serves two closely intertwined purposes: the production of high quality science and the production of human capital. These two purposes feed into each other. The obvious direction is that we develop human capital by paying people to produce science. What is perhaps less obvious is that the very fact that human labor is used to produce science has historically been an important input to its quality. The goal of science is not simply to produce papers, but rather to produce good work--that a person is willing to spend months working on a paper is a (weak) witness to the fact that it has some minimum quality. If someone has a record of producing high quality work, that they wrote a paper is a stronger witness, since it was worth the opportunity cost to write it. If many people engage with it substantially, that is even stronger evidence. This is not to say that there isn't lots of low-quality work--there is, in fact a huge amount--but we have strong sorting mechanisms, admittedly using imperfect proxies (all depending on costly human labor!), to find high-quality stuff. Arguably the paper itself is not the primary product here; in many cases the primary product is actually the expertise developed over the course of producing it, which can then be applied to other questions. If you believe, as I do, that producing high quality science should be one of our fundamental goals, I think you’re obligated to embrace new tools that help one do so. Refusing to is a declaration that these outputs are not important. But I worry that we are not on track to automate the production of good work; rather, we are on track to automate the production of papers. We need new mechanisms to ensure that we are also producing good work, and to ensure that we are developing the human capital to engage with it.

21

430

39

113

22K

uzpg_ retweeted

levent

@__alpoge__

9 days ago

over the weekend i checked the obvious thing, which is whether mythos is able to solve the erdos unit distance problem, aka erdos problem #90. the answer is: yea

54

2K

143

394

618K

uzpg_ retweeted

jsd

@datagenproc

10 days ago

Some takes on the state of benchmarking. Here are two things you could want from a benchmark: ① Benchmark scores correspond to some measure of real-world usefulness/capability/impact. If a model scores x%, this tells you that it can automate some job, generate a productivity gain of y%, or could pose z level of risk given motive and absent mitigations. ② Benchmark scores allow good comparisons of models. It’s hard to say what a single score corresponds to in the real-world, but differences (or accelerations) in benchmark scores reliably indicate differences in real-world usefulness. I think ① is very, very hard: • METR time horizons, suitably caveated, come the closest, but ultimately don’t achieve this IMO (the most interesting thing is the doubling times seeming quite consistent across domains), • There’s some specific distributions of tasks (e.g. software reimplementation) where we can better interpret saturation (especially given human data), but it’s tough to generalize to e.g. jobs and downstream impact, • I think it’s reasonable to use low benchmark scores to rule out some threat models. But frequently the benchmarks saturate before we actually think that the risk from the relevant threat model is high, • Some questions are important enough that even though ① is really tough, it’s still worth trying to get the best number we can (I think AI R&D automation is in that category). ② is a much easier task. I think that metrics like the ECI do a mostly decent job here. Since most benchmarks are pretty correlated, useless but simple-to-evaluate tasks can work ok (though, for the ECI, we likely don’t have enough non-saturated agent benchmarks that we can evaluate new releases on). Still, even for ② you have to be careful when interpreting results, e.g. I think that the acceleration we describe here (https://t.co/hKR4iI4MbX) mostly measures frontier labs starting to RL on benchmarks and benchmark-like environments more. Here are some types of benchmarks I’m particularly interested in: • Private, very OOD benchmarks where we can compare the human and AI learning rate. This gets at something like continual learning, • Open-world evaluations similar to https://t.co/92wT9YGIme, • Better benchmarks for AI R&D. Another consideration: it seems that benchmarks have historically accelerated progress on the tasks they measure. So if there’s an application you want to motivate labs to work on, creating a benchmark is useful. Conversely, if you’re worried about AI progress being too fast that’s a reason against benchmarking. Separately, I’m very excited about better measurement of model behavior beyond capabilities, as outlined by Jacob in this post: https://t.co/6lrupBx9dc

1

32

6

13

3K

uzpg_ retweeted

Xiao Ma

@MaXiao54704

14 days ago

Everyone’s talking about OAI’s new progress on the Erdős unit distance problem, I used standard GPT-5.5 to reproduce the proof ~ 👇 https://t.co/8uKw1OXN0q Apparently Erdős offered a $1,000 bounty for it… which means my 5.5 Pro subscription might actually pay for itself ~

11

252

23

88

142K

Uzay

@uzpg_

14 days ago

cool work!

Core Francisco Park

@corefpark

15 days ago

🚨 New Paper! (Part 1: Pretraining) Many recent works show beautiful representational geometry in neural networks. But what controls the geometry of world representations during pretraining? We decouple the world from data to study this in a controlled setup. 1/n

12

573

81

438

47K

0

2

0

2

353

uzpg_ retweeted

Timothy Gowers @wtgowers @wtgowers

15 days ago

If you are a mathematician, then you may want to make sure you are sitting down before reading further.

168

9K

886

4K

3M

uzpg_ retweeted

Richard Ngo

@RichardMCNgo

over 1 year ago

Many in AI safety have narrowed in on automated AI R&D as a key risk factor in AI takeover. But I'm concerned that the actions they're taking in response (e.g. publishing evals, raising awareness in labs) are very similar to the actions you'd take to accelerate automated AI R&D.

13

361

24

115

46K

uzpg_ retweeted

METR @METR_Evals

16 days ago

Could an AI company lose control of its own agents? To find out, Anthropic, Google, Meta, and OpenAI let us (1) test their best internal models with CoT access, (2) review non-public info about capabilities, alignment, and control. The result: our first Frontier Risk Report.

METR_Evals's tweet photo. Could an AI company lose control of its own agents? To find out, Anthropic, Google, Meta, and OpenAI let us (1) test their best internal models with CoT access, (2) review non-public info about capabilities, alignment, and control.

The result: our first Frontier Risk Report. https://t.co/sUpiHgCrTM

30

897

195

544

337K

Uzay

@uzpg_

16 days ago

cool work!

Jiaxin Wen

@jiaxinwen22

17 days ago

New post: "Generalization Dynamics of LM Pre-training" Most people (including me) assume that LMs smoothly mature from pattern-matching to generalizing. This mental model is wrong. The true dynamics are stranger, and far more fascinating! We call it Mode-Hopping.

11

537

83

529

95K

0

3

457

Uzay

@uzpg_

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users