This new eval shows how you can use AI models from different providers to boost performance by 20%+ on a suite of technical problems involving logic, numeracy, geometry, calculus, statistics, and coding.
Simply put, the models can catch each other making mistakes.
1/🧵
This work is the culmination of years of effort on AI evaluation science and third-party risk assessment and disclosure mechanism design. It feels like a big milestone for METR.
We designed this new procedure with an eye toward “showing by doing” how we think evaluations for the AI loss-of-control threat model should work: laying out a process that can be done periodically, not just immediately pre-deployment, and holistically assessing risk inside of an AI lab, rather than just an individual AI system.
The exercise also involved significantly deeper access than we've previously had, including raw chains-of-thought from the developers' best models and info about private model training & control protocols.
The report is long, with a bunch of new evaluation results and documentation of our process. Please, check it out, or at least the executive summary!
"Regulating AI" is about as meaningful as saying "regulate technology" and people use this to mean all sorts of things. I think it's important, but there are some hard Qs and I want to outline them here to make it more obvious why this isn't particularly easy in practice.
The tl;dr is "AI governance should be threat-model-specific, layered, and allocated to the actors best placed to mitigate the relevant harm. Sometimes that will mean model-level duties, sometimes product or sectoral rules, sometimes user liability, and often a mix. Many laws already exist and apply; the case for new ones depend on the specific failure mode, the adequacy of existing law, and how much weight you put on the precautionary principle."
Multiplicity of actors and artifacts
1. Models are the starting point, but they are distributed to millions of users and businesses who then subsequently fine-tune, adapt, 'scaffold' them in all sorts of ways. These then become different products, which labs or cloud companies have little visibility over (for commercial, IP, and privacy reasons, amongst others).
2. The reality is that 'capabilities' come from both the models (that keep improving over time), and the wider affordances they are connected to (coding environments, tools, multi-agent scaffolds, external API calls etc).
3. There is sometimes a temptation to focus 'upstream' on models; this assumes you can deal with whatever risk you have in mind at that layer, and not have to think about it afterwards. This seems questionable both practically and normatively. This does not mean upstream governance is irrelevant, but treating them as the only or final leverage point is mistaken.
Threat/risk models matter
4. A specific governance/regulatory intervention for risk A will not necessarily be appropriate for risk B. The great majority I think are product/sector-specific. Financial risks of a high-frequency trading AI tool are better addressed by the SEC than by some generalist institution.
5. Other risks will entail a mix: e.g. some cyber-offense might be partly mitigated by model-level interventions (e.g. refusals), and partly by product-level ones (e.g. filters). More often than not, this changes over time, since much of this remains an R&D heavy area.
6. The right distribution of responsibilities is tricky and evolving a lot as the market takes shape. Generally my proxy is 'who is most proximate to the harm, and best placed to address it'. And for many risks, this depends on users too; if you ignore this, you're effectively creating a moral hazard.
7. In any case, a particular threat model needs to be agreed/defined, and there are a lot of disagreements between domain experts here (depending on the harm), sometimes due to differnt levels of risk tolerance, or sometimes due to how to design an externally valid evaluation.
Existing laws and scoping
8. People love saying that AI operates in a wild west and this is plainly not true. Plenty of laws, regulations, tort liability, and legal precedent shape both how models and products are developed across the board. Lawyers will know this well, but generalists tend to have little visibility over this. Often this is completely ignored because it's a messy legal vortex and policymakers are incentivized to advocate for new laws.
9. Assuming you think the status quo is insufficient, the next challenge becomes scoping. Proxies like compute thresholds continue to be quite coarse and less predictive (https://t.co/S2gnjlm8qi).
10. Capabilities-based thresholds for models have the chicken and egg problem of 'how do you know which model to evaluate for these capabilities in the first place'. Hybrids are hard.
11. For products, you have the longstanding question of how to treat software, and not polluting the digital space with the same bureaucracy that prevents anything being built in the physical world.
Evals, safeguards and mitigations
12. People often talk about evaluations as a 'mitigation' when really they are more information-surfacing mechanisms. Often I think this is very helpful and helps markets make informed choices. But they don't necessarily mitigate the underlying risk.
13. It's important to also note that mitigations (just like evals) are improving: the UK AISI notes that 'We've seen significant progress in the safeguards of certain Al systems, particularly in the biological misuse domain.' (https://t.co/GddwD9FV2C)
14. However from a governance point of view, the items raised in bullets 1, 2, and 6 above matter here. Some capabilities diffuse quickly through open research, products, and scaffolding, even if the frontier itself remains concentrated for some period. This means that you will want safeguards to be applied by different actors, depending on risk and proximity. My own bias is to favour permissionless innovation, with some very narrow exceptions.
Specificity, grey areas, and standards
15. Even if you codify high-level obligations in law, you now have the challenge of assessing whether or not a particular entity or model is compliant. Often this is determined by looking at legal tests (when litigated in court) or through standards (e.g. ISO, FMF etc).
16. But the standards for many risks don't exist. Few people seem invested in writing them clearly, and I don't blame them: it's hard, requires industry buy-in, and the technology evolves quickly since this is an R&D heavy field.
17. I think this remains a neglected/underrated area: even at the lab layer, there's no explicit standard specifying what a good 'frontier safety policy' is, or de minimis quality requirements for an evaluation. We are seeing some good progress in recent months though, e.g. CoT legibility/monitoring as a nascent norm.
New mechanisms and institutions
18. There are interesting proposals a la 'regulatory markets for AI safety' and insurance mechanisms that I think are worth exploring and considering. I think there are promising directions but it's still early, and the devil is in the detail. I'm more unsure when this pertains to risks that are ill-defined/specified. For example there's a lot of disagreement on how to interpret 'misalignment'.
19. One thing I personally want to avoid is an ineffective rent-seeking middleware layer like the current European medical device rules which are notoriously slow, costly, unpredictable, and unnecessarily complex. There's a whole world of regulatory affairs consultancies, EU Authorized Representatives, clinical evaluation writers, QMS implementers etc that I think often produce little safety benefit relative to cost.
20. Ultimately there are different views and approaches here, I'm unsure about a lot of this, and the above isn't meant to discredit efforts on AI policy advocacy. I mostly want to unpack some of the cruxes/trade-offs for people who tend to read about 'AI governance' in more high-level and abstract pieces.
Over the past year, AI agents have learned how to self-replicate. In our test environment, an agent hacks a remote computer and copies itself onto it. Each copy then hacks more computers, forming a chain.
Join us for a week-long writers retreat at the lovely @Casa_Tilo from Aug 29th - Sep 5th -- one post a day for 5 days! Hosted by myself and @chiaragerosa :)
Ever wanted to do nothing but write for a week? Well, today’s your lucky day! @yearningslav & I are hosting a writers retreat at the beautiful @Casa_Tilo Barcelona from Aug 29th - Sep 5th. Link below!
...@MaximeStauffer & @jpsnoeij who'll take me on a surprise adventure every other month (the only clue i have for the next one is 'sweat')
...and another from my wonderful friend @zadig_1 who illustrated a poem i wrote when i was 15 and turned it into a children's book for me
Excited about our new paper: AI Agent Traps
AI agents inherit every vulnerability of the LLMs they're built on - but their autonomy, persistence, and access to tools create an entirely new attack surface: the information environmental itself.
The web pages, emails, APIs, and databases agents interact with can all be weaponised against them. We introduce a taxonomy of six classes of adversarial threats - from prompt injections hidden in web pages to systemic attacks on multi-agent networks.
I’m outlining the six categories of traps in the thread bellow
I'm spending more cognitive effort than I'd like parsing documents that are clearly 'lazily prompted low-effort AI outputs with some plausible deniability formatting cleanups' rather than 'AI-assisted, filtered, and finely crafted for a bespoke purpose and audience'.
- deeply understanding the female experience (biology, emotions, reactions)
- leaning into healthy masculinity; exploring femininity; playing with both confidently and attractively
- appreciation for the complexity of human relationships, not over-simplifying them
- sitting with tension in relationships, not needing to fix things immediately
- attention to detail & beauty in physical space
- a deeper appreciation for gift-giving
This weekend I made a game from scratch to play with my 12 year old daughter.
If you are a dad, you know how popular multiplayer games are with tweens.
With Claude, I was able to whip out a game in two days & collaborate with my daughter on features & gameplay.
Amazing family experience!