🚨Are LLMs truly ready for autonomous data science?
Real-world data is messy—missing values, outliers, inconsistencies—and if not handled properly, can lead to wrong conclusions.
🌟We introduce RADAR, a benchmark evaluating whether LLMs can handle imperfect tabular data. 🧵
Today, we’re announcing the next chapter of Terminal-Bench with two releases:
1. Harbor, a new package for running sandboxed agent rollouts at scale
2. Terminal-Bench 2.0, a harder version of Terminal-Bench with increased verification
Even as agents gain greater autonomy, it’s critical to consider human collaboration in agent design and evaluation. This work is an important step in this direction. Check it out!
Today's AI agents are optimized to complete tasks in one shot. But real-world tasks are iterative, with evolving goals that need collaboration with users.
We introduce collaborative effort scaling to evaluate how well agents work with people—not just complete tasks 🧵
True intelligence = reasoning about new information, not memorized facts.
How can we scalably create benchmarks that are completely novel yet have known answers?
Meet SynthWorlds, an eval & data-gen framework to disentangle reasoning and knowledge⬇️🧵
📄https://t.co/ITwP4YdtDG
Lastly, our framework is fully automatic 🤖!
SynthWorlds scales to new worlds and tasks beyond our initial dataset📈
Generate novel worlds, facts, and tasks to benchmark reasoning. No human labeling required.
Perfect for synthetic data and testing agentic/long-context models🏄
(please reshare) I'm recruiting multiple PhD students and Postdocs @uwcse@uwnlp
(https://t.co/I5wQsFnCLL). Focus areas incl. psychosocial AI simulation and safety, Human-AI collaboration.
PhD: https://t.co/ku40wCrpYh
Postdocs: https://t.co/K9HUIPJ5h6
This framework lets us systematically generate tasks across table sizes and imperfect table variants.
In total, RADAR includes 2,980 table–query pairs, 53 tasks, 6 artifact variants (including 1 clean), 3 column number variants, and 4 table sizes: 2K, 4K, 8K, and 16K tokens.
💡Our results highlight key implications for deploying LLM agents: when to allocate test-time compute, how to efficiently represent data tables for detecting imperfections, and how to coordinate code execution with table inspection.
Additional results and details in the paper!