RLM gains probably come from the fact that having the model reason in abstractions to spawn agents conditions it to spend much more inference tokens and kinds of thinking than if u just ran it on the task, or asked it to spawn agents.
a year back @kushal1t@KaivuHariharan@atticuswzf found similar findings when we made models generate their own tools in code to play games, like in chess, creating various useful heuristics they wouldn't employ by default, even if they could
in https://t.co/HGdpwsq1mm we found that interventions that reduced data diversity, effectively decreasing this gradient interference, allowed networks to move faster to generalizing solutions, whereas here they study the ability (it seems, haven't read in detail yet) to memorize rarer tasks
Very excited to have this paper out! We show by having more parameters, larger models see reduced interference between updates. This allows them to retain memories of rarely observed samples of a task, eventually allowing them to learn even the tail-end of the distribution. (1/3)
Just finished my PhD at @MITCSAIL. In July, I'll start as an assistant professor at the @Harvard@Kennedy_School. I have lots to learn and lots to do.
With others (some TBA 👀) at HKS, I'm looking forward to helping academia offer guidance for governing the next chapters of AI.
STEM academia serves two closely intertwined purposes: the production of high quality science and the production of human capital. These two purposes feed into each other. The obvious direction is that we develop human capital by paying people to produce science.
What is perhaps less obvious is that the very fact that human labor is used to produce science has historically been an important input to its quality. The goal of science is not simply to produce papers, but rather to produce good work--that a person is willing to spend months working on a paper is a (weak) witness to the fact that it has some minimum quality. If someone has a record of producing high quality work, that they wrote a paper is a stronger witness, since it was worth the opportunity cost to write it. If many people engage with it substantially, that is even stronger evidence. This is not to say that there isn't lots of low-quality work--there is, in fact a huge amount--but we have strong sorting mechanisms, admittedly using imperfect proxies (all depending on costly human labor!), to find high-quality stuff. Arguably the paper itself is not the primary product here; in many cases the primary product is actually the expertise developed over the course of producing it, which can then be applied to other questions.
If you believe, as I do, that producing high quality science should be one of our fundamental goals, I think you’re obligated to embrace new tools that help one do so. Refusing to is a declaration that these outputs are not important. But I worry that we are not on track to automate the production of good work; rather, we are on track to automate the production of papers. We need new mechanisms to ensure that we are also producing good work, and to ensure that we are developing the human capital to engage with it.
over the weekend i checked the obvious thing, which is whether mythos is able to solve the erdos unit distance problem, aka erdos problem #90. the answer is: yea
Some takes on the state of benchmarking.
Here are two things you could want from a benchmark:
① Benchmark scores correspond to some measure of real-world usefulness/capability/impact. If a model scores x%, this tells you that it can automate some job, generate a productivity gain of y%, or could pose z level of risk given motive and absent mitigations.
② Benchmark scores allow good comparisons of models. It’s hard to say what a single score corresponds to in the real-world, but differences (or accelerations) in benchmark scores reliably indicate differences in real-world usefulness.
I think ① is very, very hard:
• METR time horizons, suitably caveated, come the closest, but ultimately don’t achieve this IMO (the most interesting thing is the doubling times seeming quite consistent across domains),
• There’s some specific distributions of tasks (e.g. software reimplementation) where we can better interpret saturation (especially given human data), but it’s tough to generalize to e.g. jobs and downstream impact,
• I think it’s reasonable to use low benchmark scores to rule out some threat models. But frequently the benchmarks saturate before we actually think that the risk from the relevant threat model is high,
• Some questions are important enough that even though ① is really tough, it’s still worth trying to get the best number we can (I think AI R&D automation is in that category).
② is a much easier task. I think that metrics like the ECI do a mostly decent job here. Since most benchmarks are pretty correlated, useless but simple-to-evaluate tasks can work ok (though, for the ECI, we likely don’t have enough non-saturated agent benchmarks that we can evaluate new releases on).
Still, even for ② you have to be careful when interpreting results, e.g. I think that the acceleration we describe here (https://t.co/hKR4iI4MbX) mostly measures frontier labs starting to RL on benchmarks and benchmark-like environments more.
Here are some types of benchmarks I’m particularly interested in:
• Private, very OOD benchmarks where we can compare the human and AI learning rate. This gets at something like continual learning,
• Open-world evaluations similar to https://t.co/92wT9YGIme,
• Better benchmarks for AI R&D.
Another consideration: it seems that benchmarks have historically accelerated progress on the tasks they measure. So if there’s an application you want to motivate labs to work on, creating a benchmark is useful. Conversely, if you’re worried about AI progress being too fast that’s a reason against benchmarking.
Separately, I’m very excited about better measurement of model behavior beyond capabilities, as outlined by Jacob in this post: https://t.co/6lrupBx9dc
Everyone’s talking about OAI’s new progress on the Erdős unit distance problem, I used standard GPT-5.5 to reproduce the proof ~ 👇
https://t.co/8uKw1OXN0q
Apparently Erdős offered a $1,000 bounty for it… which means my 5.5 Pro subscription might actually pay for itself ~
🚨 New Paper! (Part 1: Pretraining)
Many recent works show beautiful representational geometry in neural networks.
But what controls the geometry of world representations during pretraining?
We decouple the world from data to study this in a controlled setup.
1/n
Many in AI safety have narrowed in on automated AI R&D as a key risk factor in AI takeover. But I'm concerned that the actions they're taking in response (e.g. publishing evals, raising awareness in labs) are very similar to the actions you'd take to accelerate automated AI R&D.
Could an AI company lose control of its own agents? To find out, Anthropic, Google, Meta, and OpenAI let us (1) test their best internal models with CoT access, (2) review non-public info about capabilities, alignment, and control.
The result: our first Frontier Risk Report.
New post: "Generalization Dynamics of LM Pre-training"
Most people (including me) assume that LMs smoothly mature from pattern-matching to generalizing.
This mental model is wrong. The true dynamics are stranger, and far more fascinating!
We call it Mode-Hopping.