Gwen Umbach

@gju__

building chips @etched. previously investor @ altimeter capital

San Francisco, CA

Joined December 2018

1.5K Following

1.4K Followers

170 Posts

gju__ retweeted

Noam Brown

@polynoamial

4 months ago

Perhaps a 🌶️ take but I think the criticisms of @GoogleDeepMind's release are missing the point, and the real problem is that AI labs and safety orgs need to adapt to a world where intelligence is a function of inference compute. When Google says that Deep Think poses no new risks beyond Gemini 3 Pro, they probably mean that Deep Think is a scaffold of Gemini 3 Pro that anyone externally could have constructed on their own anyway. In other words, the capabilities of Deep Think have always been available to anyone willing to pay for Deep Think amounts of inference, simply by scaffolding a bunch of Gemini 3 Pro queries together. Deep Think just makes that more convenient for the casual user. The corollary of this is that capabilities far beyond Gemini 3 Deep Think are already available to anyone willing to scaffold a system together that uses even more inference compute. As a trivial example, you could run 10 Deep Think queries and just do consensus over them. That would be 10x the cost but would have higher performance on many benchmarks. Most Preparedness Frameworks were developed in ~2023 before the era of effective test-time scaling. But today, there is a massive difference on the hardest evals between something like GPT-5.2 Low and GPT-5.2 Extra High. Scaffolds are also much more effective. So if you want to evaluate whether Gemini 3 can, for example, help make a bio weapon, the answer may depend on how much inference compute you give it. In my opinion, the proper solution is to account for inference compute when measuring model capabilities. E.g., if one were to spend $1,000 on inference with a really good scaffold, what performance could be expected on a benchmark? ARC-AGI has already adopted this mindset but few other benchmarks have. Of course, serious entities like state actors could spend well beyond $1,000. Accurate benchmark evaluations can require dozens of queries on hundreds of problems. So, if we want to measure a model's capability when using $1 million of inference, we might need to spend billions of dollars for each model release! But in the same way that pretraining scaling laws can predict the capabilities of larger pretrained models, performance also scales somewhat cleanly with additional inference compute. In my opinion, it should become standard practice for all system cards to show plots of benchmark performance as a function of inference compute, and safety thresholds should be based on a projection of what performance would look like at $1 million+ of inference compute. If that were the norm, then indeed releasing Deep Think probably would not result in a meaningful safety change compared to Gemini 3 Pro, other than making good scaffolds more easily available to casual users.

441

207K

Gwen Umbach

@gju__

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users