We made a chart of 44 documented incidents of AI agents acting against user intent – sometimes subverting routine security and deceptively hiding evidence of their actions.
Yeah, open sourcing an alignment recipe could be good. If I'm taking a more optimistic view, it's plausible that loss-of-control risk is largely concentrated in the frontier of AI capabilities, as opposed to the open-source models lagging behind. It's also plausible that to the extent that some models' capabilities are reliant on distillation from other labs, their capabilities will slow down a lot if the labs they're distilling from slow down. I'm not sure though. I would like to see more detailed threat modeling.
@mattsheehan88 Putting aside whether China's testing requirements are good, they rely on fixed question-answer evals which are so easy to cheese. The resulting models don't have to be (and empirically aren't) adversarially robust.
@RuxandraTeslo "a city that was essentially a giant consumption machine that produced nothing" this is overclaiming. Edo's commoners were half the city and produced plenty.
President Trump's executive order today takes several steps to secure America against AI-enabled cyber threats: hardening government and critical infrastructure, voluntary collaboration with AI industry to identify and patch vulnerabilities, and going after AI-enabled cybercrime.
I'm honored to be one of the few Americans chosen for the AI Scientific Panel. I'm excited to contribute technical expertise here and help make sure U.S. perspectives are represented. AI policy for the most capable models can be more thoughtful when there's pragmatic, independent analysis to inform it.
@deanwball perhaps one day they may transcend the distinction between nouns, verbs, adjectives, and adverbs altogether, just as everything is a verb in lojban
Limitations of report: This report isn’t robust oversight of frontier AI developers by itself. METR has some levers to incentivise companies’ participation, including some relevant legislation, but ultimately participants could have pulled out at any time if the result would be contrary to their interests.
You can view it partly as a pilot exercise of what regulation (or formalized industry standards) could/should require, or what partners/suppliers/customers/employees should demand from frontier developers.
Quoting from the report: “METR’s work relies on developing and maintaining strong working relationships with companies, and this impacted both how we designed the process for this pilot (e.g. offering the silent exit option) and lower-level judgment calls as the process unfolded (e.g. having a relatively high bar for what redactions we pushed back on). In some cases we refrained from making an unflattering claim because the claim was neither solidly defensible nor particularly relevant to our core assessment. We also made efforts not to invite salient comparisons between companies on capabilities or safety.”
It doesn’t feel to me like this distorted our overall conclusions too much in this case. But that was partly because the conclusions weren’t that spicy. If our conclusions reflected very negatively on AI developers or would directly lead to e.g. govt intervention or public outcry, we’d be in a difficult position. We’d be trying to balance keeping the companies happy enough that they didn’t pull out of the program (using the “no-fault exit” mechanism) vs being transparent about our conclusions.
We clearly need more robust mechanisms than this for providing accountability for AI developers.
At least for loss-of-control risk, the right timing is periodic evals of the most capable internal models, rather than evaluating a model right before public release