Huge congrats to @Humana's Erius agent taking the #1 spot on CHI-Bench for Prior Auth and 6th for all domains. It outperforms every frontier lab on one of healthcare's hardest workflows.
CHI-Bench leaderboard just gets updated with the newest and highest score from @claudeai Opus 4.8.
CHI-Bench is world's first long-horizon benchmark for healthcare AI agents.
Leaderboard: https://t.co/wjd9wK44eU
CHI-Bench is the world's 1st long-horizon healthcare benchmark for AI agents.
If you're building or buying AI for healthcare, this is the test that actually matters — real clinical workflows, not toy demos.
U.S. healthcare needs this. 🏥🔬
actAVA AI integrates CHI-Bench with @huggingface and @harborframework today.
Users can run the CHI-Bench evaluation and RL training from both platforms.
🚨 Historic moment for @actAVAai ! 📷Just one day after launch, our benchmark dataset is already #10 most popular on Hugging Face — out of 1 million+ datasets! Huge thanks to @iscreamnearby , @HaolinChen11 , Deon Metelski, Leon Qi, Tao Xia, Joon Lee, Steve Brown, Kevin Riley, T. Y. Alvin Liu, M.D., Zhiwei Liu, Qingsong Wen, @CaimingXiong , Sanmi Koyejo, Eric Xing & all our collaborators. 📷📷
A new 33-author benchmark called CHI-Bench finds that the best AI agent configuration resolves only 28% of realistic healthcare administration tasks, dropping to 3.8% in continuous-session testing.
(1/n) After a few months of work with multiple hospitals, universities and research facilities, today we're open-sourcing CHI-Bench: the first long-horizon benchmark for healthcare AI agents on real clinical and healthcare workflows.
Best frontier agent overall: 28% pass@1.
End-to-end prior authorization: 0%.
A thread on what we found 🧵
1/ Introducing CHI-Bench 🧵
Can AI agents automate U.S. healthcare workflows end to end — given only clinician & insurer apps, operations, and a medical policy library?
75 long-horizon workflows × 30 frontier agents. Best agent solves just 28%.
#AIinHealthcare 👇
Proud to have helped build CHI-Bench 🧵
Can frontier agents run U.S. healthcare workflows end to end? 75 long-horizon tasks, 30 agents — best solves just 28%. We're early, and now we can measure it.
Fully open 👇
1/🧵Can AI agents automate U.S. healthcare workflows end to end given just clinician & insurer apps and operations, medical policy library? Introducing CHI-Bench: 75 long-horizon realistic healthcare workflows × 30 frontier agents. Best agent solves only 28% #AIinHealthcare 👇