Perhaps a πΆοΈ take but I think the criticisms of @GoogleDeepMind's release are missing the point, and the real problem is that AI labs and safety orgs need to adapt to a world where intelligence is a function of inference compute.
When Google says that Deep Think poses no new risks beyond Gemini 3 Pro, they probably mean that Deep Think is a scaffold of Gemini 3 Pro that anyone externally could have constructed on their own anyway. In other words, the capabilities of Deep Think have always been available to anyone willing to pay for Deep Think amounts of inference, simply by scaffolding a bunch of Gemini 3 Pro queries together. Deep Think just makes that more convenient for the casual user.
The corollary of this is that capabilities far beyond Gemini 3 Deep Think are already available to anyone willing to scaffold a system together that uses even more inference compute. As a trivial example, you could run 10 Deep Think queries and just do consensus over them. That would be 10x the cost but would have higher performance on many benchmarks.
Most Preparedness Frameworks were developed in ~2023 before the era of effective test-time scaling. But today, there is a massive difference on the hardest evals between something like GPT-5.2 Low and GPT-5.2 Extra High. Scaffolds are also much more effective. So if you want to evaluate whether Gemini 3 can, for example, help make a bio weapon, the answer may depend on how much inference compute you give it.
In my opinion, the proper solution is to account for inference compute when measuring model capabilities. E.g., if one were to spend $1,000 on inference with a really good scaffold, what performance could be expected on a benchmark? ARC-AGI has already adopted this mindset but few other benchmarks have.
Of course, serious entities like state actors could spend well beyond $1,000. Accurate benchmark evaluations can require dozens of queries on hundreds of problems. So, if we want to measure a model's capability when using $1 million of inference, we might need to spend billions of dollars for each model release!
But in the same way that pretraining scaling laws can predict the capabilities of larger pretrained models, performance also scales somewhat cleanly with additional inference compute. In my opinion, it should become standard practice for all system cards to show plots of benchmark performance as a function of inference compute, and safety thresholds should be based on a projection of what performance would look like at $1 million+ of inference compute.
If that were the norm, then indeed releasing Deep Think probably would not result in a meaningful safety change compared to Gemini 3 Pro, other than making good scaffolds more easily available to casual users.
Weβve raised over $400M at a $10.2B post-money valuation to advance the frontier of AI coding agents.
The round was led by Founders Fund with other existing investors including Lux, 8VC, Neo, Elad Gil, Definition Capital, and Swish VC all doubling down. Weβre also joined by new investors including Bain Capital Ventures and D1 Capital.
Two of our early investors, Christian Lawless of Conversion Capital and Emily Cohen of Neo, have even joined our team full-time.
Thinking about other areas where you can encode heuristics: CX, ITSM, HR, SecOps, supply chain / logistics, accounting, marketing, legal / compliance, insurance claims processing, education, travel / hospitality, healthcare admin / back office⦠what am I missing? What are areas that are harder to encode - maybe clinical decision making?
Have been thinking about how LLMs are really just encoded heuristics. Heuristics are simply techniques that are based on any sort of practical knowledge and/or experience - i.e. mental shortcuts that can help with problem solving. Language (obviously) is the best example of this (chunking / predictive / familiarity heuristics). Another good example of this is driving (distance / speed-matching / right-of-way heuristics).
The overarching opportunity in AI is as broad as anything that can be encoded as a heuristic. Not saying this is a perfect science, but the best way to frame this is using heuristics to mimic a process / workflow by drawing relevant info, make a decision, then perform an action. The application of this is infinite (especially wrt services). Models that are built / fine-tuned for more specific use cases using heuristics β> immediate ROI. Easy to picture how this could be applied to a co like $CRM (i.e. automate prospecting, qualifying, scheduling calls, post-meeting debriefs, etc).
Weβre hosting a happy hour at ICML next week in Vienna! Serving data martinis, shaken not stirred.
Sign up here: https://t.co/KP2x3KJEKC
@rak_garg@brittwalker_