unbeatable pattern: set up benchmarks based on direct user needs (i.e. my agent should do this for the user) and use coding agents to hillclimb them by evolving the prompt/scaffolding
Every company building on top of AI should be making their own benchmarks.
This is the way if you want model progress to disproportionally benefit your company.
We are hiring a bunch of Members of the Technical Staff for @GoogleAIStudio who can blend PM, design, eng, and more
If this is you, pls DM me, we will move fast for the best people.
good example of asymmetric cyber offense that will get worse with ai. source code analysis wouldn’t have found this. if they’d taken some more steps, eg introduced into upstream packages or over multiple commits thru ci, this would’ve gone undetected longer.
lots of ai on offense will probably be able to overpower lots of ai on defense, in the medium term. combinatorial surface area to defend. and now the dark forest has infinitely more creatures lurking.
@felixrieseberg It'll be tough for Anthropic to keep this lead though. The winning model will be a really fast one, and good at multi-agent communication (for scaling both context and instances), and one with the most compute infra. The other labs have structural advantages here.
wow. maybe one of the best products ever launched, and the start of the great hardware unbundling
soon, we'll have armies of computer users w persistent storage. we'll have organizations of long-running coworkers. we'll be able to talk to them while we sit by a campfire
Today, we’re releasing a feature that allows Claude to control your computer: Mouse, keyboard, and screen, giving it the ability to use any app.
I believe this is especially useful if used with Dispatch, which allows you to remotely control Claude on your computer while you’re away.
@felixrieseberg I do think the implications for devices are pretty massive. If you can have an agent that uses all your apps for you remotely, there may not be a reason for an iPhone anymore. And devices can come in any form, any interface.
One counterbalance will be that some advances from automated research will be rapidly diffused - if Lab A discovers a new algorithm, it probably will leak to Lab B rather quickly. But this doesn’t fully counteract these dynamics, since many advances will be codesigned for each lab’s infra, model particularities, and individual strategies, and thus won’t be easily transferable.
Automated AI research will lead to new scaling laws.
When you have 10x the compute, and 10x the number of automated researchers/engineers, I’d bet returns will steeper than pretraining and RL, given the parallelizability of AI research and compounding.
While compute advantages thus far have come out in the wash (having 2x the compute hasn’t proved much more important, as Dario observes in Dwarkesh episode), a sharper curve likely means more monopolistic market dynamics.