New AstaBench results show frontier models making progress on scientific research, but the benchmark remains far from solved.
Claude Opus 4.7 leads overall at 58.0%, while GPT-5.5 comes within 5.1 points at less than half the measured cost per problem. 🧵
We built MolmoWeb from the scratch with Molmo2!!! 💕🌐
It’s not easy to build SOTA web agents out of open source VLMs, when they can be so profitable that very few projects release everything (if anything), esp the datasets 🔑
But, we just released all the MolmoWeb model checkpoints and datasets from ai2😉
Can’t wait to see what the community builds on top of MolmoWeb!🫡
🔎 Deep research agents like Asta ScholarQA and OpenAI Deep Research are transforming how we perform literature review.
But how do we know if the way we evaluate them is actually meaningful?
Announcing our new paper: “Deep Research, Shallow Evaluation: A Case Study in Meta-Evaluation for Long-Form QA Benchmarks” 🧵
Are you a researcher in CS or a CS-adjacent field curious about how an AI agent can help you with your research project? Want to try a new tool for your research support in a paid user study ($100, 2 hr)? Limited spot numbers. See details and sign up here: https://t.co/lAhe3zNUK1
Can AI predict what scientists will do next—not just one piece, but the whole research process? PreScience is our new model eval for forecasting how science unfolds end-to-end, from how research teams form to a paper's eventual impact. Built with @UChicago, supported by @NSF.
We’re releasing the Theorizer code and framework + a dataset of ~3,000 theories generated by Theorizer across the field of AI/NLP, built from 13,744 source papers.
💻 Code: https://t.co/C5zr2Nm9c7
📝 Technical report: https://t.co/3LUiDkXyvc
✍️ Learn more in our blog: https://t.co/OkCG3LCqtE
I'm so excited by this! Our system is generating some insightful & novel theories (e.g., internally for LM post-training). And it's still getting better!
Introducing Theorizer: Turning thousands of papers into scientific laws 📚➡️📜
Most automated discovery systems focus on experimentation. Theorizer tackles the other half of science: theory building—compressing scattered findings into structured, testable claims. 🧵
Introducing Ai2 Open Coding Agents—starting with SERA, our first-ever coding models. Fast, accessible agents (8B–32B) that adapt to any repo, including private codebases. Train a powerful specialized agent for as little as ~$400, & it works with Claude Code out of the box. 🧵
Smart analysis analysis of scholar output when authors adopted LLMs as part of their writing: 1) huge 36% boost in # papers published 2) LLMs mitigate skill disparities, eg native language - enough to shift market share of production toward China https://t.co/GPaak6dguv @yian_yin
🆕 New in Asta: multi-turn report generation.
You can now have back-and-forth conversations with Asta, our agentic platform for scientific research, to refine long-form, fully cited reports instead of relying on single-shot prompts.
🧠 Introducing NeuroDiscoveryBench. Built with @AllenInstitute, it’s the first benchmark for evaluating AI systems like our Asta DataVoyager agent on neuroscience data. The benchmark tests whether AI can truly extract insights from complex brain datasets.
#NeurIPS2025 and AI x Science?
Some fun announcements are coming up. Stay tuned.
Also, our Asta internship application is still open -- apply and mention my name if you'd like to work w me ~
Announcing Olmo 3, a leading fully open LM suite built for reasoning, chat, & tool use, and an open model flow—not just the final weights, but the entire training journey.
Best fully open 32B reasoning model & best 32B base model. 🧵
Plenty of AI-gen papers in ICLR. Wonder why?
🚨 In a preregistered Randomized Controlled Trial, we find: CS authors perceive AI-abstracts as more readable, tend to edit less than their published counterparts. AI-use and its disclosure shape the fabric of collaborative scientific writing.
Work led by @hsanchaita & @leadoeun27, advised by @shocheen & yours truly.
1/n
🔥Thrilled to introduce DR Tulu-8B, an open long-form Deep Research model that matches OpenAI DR 💪Yes, just 8B! 🚀
The secret? We present Reinforcement Learning with Evolving Rubrics (RLER) for long-form non-verifiable DR tasks! Our rubrics:
- co-evolve with the policy model
- are grounded on search knowledge
🧵
Agent benchmarks don't measure true *AI* advances
We built one that's hard & trustworthy
👉AstaBench tests agents w/ *standardized tools* on 2400+ scientific research problems
👉SOTA results across 22 agent *classes*
👉AgentBaselines agents suite
🆕https://t.co/BFjdGCAp1w
🧵👇
Super interesting and well written summary of the incredible progress we’ve made on climate change (and what’s most important to do next) ⭐️⭐️⭐️⭐️⭐️ https://t.co/hSensrfSdY
📊 Today we're releasing data showing which scientific papers our AI research tool Asta cites most frequently. Think of it as creating citation counts for the AI era—tracking which research is actually powering AI answers across thousands of queries. 🧵
Introducing Asta DataVoyager—our new AI capability in Asta that turns structured data into transparent, reproducible insights. Built for scientists, grounded in open, inspectable workflows. 🧵