(๐งต) Happy to release AIRS-Bench, a benchmark to test the autonomous machine learning abilities of AI research agents ๐ค
AIRS-Bench includes 20 tasks sourced from machine learning papers that assess the autonomous research abilities of LLM agents throughout the full research lifecycle, from hypothesis generation ๐ก and implementation ๐ ๏ธ to experimentation ๐งช and analysis ๐
Each task is extracted from a paper with a state-of-the-art result and consists of a:
๐ problem description (e.g. text similarity)
๐๏ธ a dataset (e.g. SICK) and
๐ a metric (e.g. Spearman correlation) to optimise over
The agent is then given a GPU and 24 hours to develop and submit a Python solution that matches or exceeds the paper SOTA ๐
Read on for baseline results and examples of agents surpassing human SOTA ๐
๐ฑWe open-source the AIRS-Bench task definitions and evaluation code to accelerate in autonomous scientific research:
๐ป GitHub: https://t.co/UXzNXyGdU5
๐ ArXiv: https://t.co/badN0jq0IA
๐ค HF paper: https://t.co/6FIWxF0Bsw
๐ Meta AI website: https://t.co/wcIWLrlYBU
Huge shoutout to the team from Meta FAIR who painstakingly crafted, debugged and inspected every single of these tasks and its runs across more than a dozen of agents @alisia_lupidi, @_tomwithanh, @BhavulGauri, @basselralomari, @albertomariape, Alexis Audran-Reiss, Muna Aghamelu, Nicolas Baldwin, @LuciaCKun, @GagnonAudet, Chee Hau Leow, Sandra Lefdal, Abhinav Moudgil, Saba Nazir, Emanuel Tewolde, Isabel Urrego, @mahnerak, @ishitamed, @EdanToledo and @rybolos, @alex_h_miller, @j_foerst, @yorambac for their leadership and support
Hello world :)
We are BOLD โ the British Open-ended Learning and Discovery Lab!
BOLD is a new academic research lab fully focussed on paradigm breaking discoveries in fundamental AI. We work towards more efficient & open AI that is built around human needs and capabilities.
To pursue these breakthroughs, we pioneer new modes of collaboration in academia that are more focussed, resourced, agile, and collaborative. Rather than fragmenting resources, today we are sunsetting 5 of the UKs leading AI labs to join forces under our joined scientific vision.
Our vision is centered around three pillars:
โก Beyond backpropagation โ questioning the foundations of the field.
๐ค Human-centric learning & discovery โ treating humans as core to our algorithms
๐ค Embodied learning โ fast learning and adapting methods that deal with the messy real world
BOLD is backed by @UKRI_News and @EPSRC with ยฃ30M โ and this is just the beginning. We are urgently looking for partners and sponsors to 10x this.
๐ https://t.co/eFVFW31mqz
๐ https://t.co/Eoad4G18KL
@j_foerst, @CULLYAntoine, @tonizza82, @shimon8282, @tonizza82, Ani Calinescu & @_rockt
Hey everyone โ big day for us at Skiplabs: Skipper Beta is live ๐
Skipper is a closed-loop coding agent. Instead of constantly going back and forth with the AI, you give it a spec and it iterates internally until it produces a working software service.
We believe this is where AI-assisted coding is heading, and weโre excited to finally share what weโve been working on behind the scenes.
Start building with Skipper: https://t.co/XYooJgn0hp
Presented at #ASCO26:
Among patients with previously treated metastatic pancreatic ductal adenocarcinoma, the RAS(ON) inhibitor daraxonrasib led to significantly longer overall survival and progression-free survival than chemotherapy. Full phase 3 RASolute 302 trial results: https://t.co/xwLWBZYRzq
@ASCO
Cheers, chills, and a standing ovation when RASolute 302 showed unprecedented survival on daraxonrasib for patients with progressive pancreatic cancer
Seldom do you sense youโre witnessing a historic moment in cancer care but this feels like ras targeting has arrived
#ASCO26
How can we help AI scientists train up their own LLM engine? Iโm pleased to share our work on AI Research Agents discovering novel language modeling architectures, showing competitive performance when scaled up at the 1B parameter size: https://t.co/8oech1DPjQ
(๐งต)Excited to share our latest work on AI Research Agents discovering novel language modelling architectures that show competitive performance when scaled up at 1B parameter size: https://t.co/IbW4LwMwu4
๐ค We gauge the ability of AI systems to autonomously design foundation models beyond the standard Transformer paradigm, by empowering LLM agents to perform both
๐ high-level architecture search โ AIRA-Compose
๐ ๏ธlow-level mechanistic implementation โAIRA-Design
(8/๐งต) In the https://t.co/fdQry3zdli task, agent-optimised pretraining code achieves 0.968 validation bits-per-byte surpassing the published minimum reference; experiments with select papers inserted into the context show that external literature occasionally helps
๐ค๐ท๐๐๐ง ๐๐ ๐ซ๐๐ฌ๐๐๐ซ๐๐ก ๐๐ ๐๐ง๐ญ๐ฌย ๐๐ข๐ฌ๐๐จ๐ฏ๐๐ซ ๐ญ๐ก๐ ๐ง๐๐ฑ๐ญ ๐ ๐๐ง๐๐ซ๐๐ญ๐ข๐จ๐ง ๐จ๐ ๐๐จ๐ฎ๐ง๐๐๐ญ๐ข๐จ๐ง ๐ฆ๐จ๐๐๐ฅ๐ฌ? We put them to the test with two complementary, model-agnostic frameworks: ๐๐๐๐-๐๐จ๐ฆ๐ฉ๐จ๐ฌ๐ and ๐๐๐๐-๐๐๐ฌ๐ข๐ ๐ง -- a thread:
Worried about Anthropic's Mythos? Fully formally verified code generation is the defense.
Combining Lean, frontier models, multi-agent scaffolds, and inference scaling, we show <12mo benchmarks jumping from 20% to 70%.
Real-world verification is here.
https://t.co/ADGXmJOLlZ
1/
๐ Happy to see AIRS-Bench, an AI R&D benchmark that Meta open-sourced earlier this year (https://t.co/ttZYThznEl), being used in the Muse Spark Safety & Preparedness Report to assess loss of control risks stemming from acceleration of AI development.
AIRS-Bench (https://t.co/UXzNXyFG4x) measures the ability of AI agents to execute end-to-end AI R&D across the full research lifecycle, from idea generation ๐ก and implementation ๐ ๏ธ to experiment analysis ๐งช and iterative refinement ๐
Along with SWE-Bench and MLE-Bench, AIRS-Bench was used to assess the risks of models automating AI R&D work and outpacing governance mechanisms. Our findings suggest that Muse Spark does not substantially contribute to the said threat, as it achieves performance superior to human researchers in only 5 out of 20 tasks and for a fraction of its attempts ๐
This is inline with results from comparison models and highlights the models' limitations to execute the complete research lifecycle consistently and across a wide range of domains ๐ค
Head over to the 158-page report for more detailed results and a wide range of assessments and mitigations under Metaโs Advanced AI Scaling Framework ๐
๐ Muse Spark Safety & Preparedness Report for Meta AI is out.
We start with our pre-deployment assessment under Meta's Advanced AI Scaling Framework, covering chemical and biological, cybersecurity, and loss of control risks. Our assessment flagged potentially elevated chem/bio risk, so we implemented safeguards and validated mitigations before deployment - bringing residual risk to within acceptable levels.
Beyond the Framework, we also share findings and early explorations of model behavior (honesty, intent understanding, etc.), jailbreak robustness, eval awareness, and more.
We're sharing this report to give a closer look at how we evaluate advanced AI safety. Always more work to do, and we welcome feedback from the community.
https://t.co/azpKHwu7x9
Excited to share AIRAโ โ our next-generation AI Research Agents for ML that address key bottlenecks to scaling.
AIRAโ achieves SoTA on real-world ML tasks from MLE-bench-30 (81.5% vs 72.7%), exceeds human SoTA on 6/20 diverse AI research tasks from AIRS-Bench (and hacks another 5), while exhibiting strong, predictable scaling properties.
To push the frontier of AI Research, we need systems that scale well. Developing AIRAโ, we learned a lot about the bottlenecks and what it takes to resolve them โ insights already driving our next iteration:
1/