Top Tweets for #AutoMedBench

25 days ago

🏆 Think your AI agent can do basic medical research end-to-end? Submit to our leaderboard (https://t.co/tXO9ioUPiN) and find out! We're launching #AutoMedBench, the first benchmark for evaluating Medical #AutoResearch agents across the entire research workflow—not just the final answer. 📄 Paper: https://t.co/X2G66j2Qx6 🌐 Project: https://t.co/bYKEVcfqEU 💻 Code: https://t.co/z37BfBBNm5 📝 Plan → ⚙️ Setup → 🔍 Validate → 🚀 Inference → 📦 Submit As AI agents move from answering medical questions to conducting end-to-end medical AI research, we need to measure where they succeed—and where they break. What we benchmark: • 24 tasks across segmentation, image enhancement, VQA, report generation, and lesion detection • 48 task-tier combinations spanning Lite and Standard settings • 6 frontier AI agents under a unified interface • Thousands of runs with detailed logs of stage performance, costs, tokens, wall time, and failure modes 📊 Current leaderboard: 🥇 #Opus 4.6: 66.5 🥈 #GLM-5: 61.6 🥉 #Gemini 3.1 Pro: 59.0 4️⃣ #ChatGPT-5.4: 55.3 5️⃣ #MiniMax-M2.5: 51.6 6️⃣ #Qwen3.5: 51.2 🔎 Key findings: ⚠️ Agents are better at completing workflows than producing high-quality scientific outputs. ⚠️ Validation is the weakest stage; Setup is the strongest. ⚠️ More scaffolding is not always better—some frontier agents actually perform worse with additional guidance. ⚠️ The dominant failures are verification and submission, not task understanding. 💡 Takeaway: The next frontier for research agents isn't just more medical knowledge—it's better workflow control, validation, error recovery, and artifact-level reasoning. #AIAgents #MedicalAI #AgenticAI #LLM #MultimodalAI #HealthcareAI #Benchmark

Yuyin Zhou

@yuyinzhou_cs

28 days ago

🩺 Can AI agents conduct medical research end-to-end, just like human researchers? Introducing AutoMedBench — the first workflow-aware benchmark for medical AutoResearch agents. 🧪 📄 Paper: https://t.co/2H666zmPov 🌐 Project: https://t.co/bYKEVcfqEU 💻 Code: https://t.co/z37BfBBNm5 Medical agents are rapidly evolving from answering questions to conducting end-to-end medical-AI research: loading datasets, building pipelines, debugging failures, running inference, and submitting results. Yet most benchmarks only evaluate the final answer. A good score can hide a broken process. A failed run often reveals nothing about where the agent went wrong. 🔬 AutoMedBench evaluates the entire workflow. Every run is decomposed into 5 stages: 📝 Plan → ⚙️ Setup → 🔍 Validate → 🚀 Inference → 📦 Submit Instead of a single score, AutoMedBench diagnoses which stage succeeds or fails—and why. 📊 Benchmark scope • 5 medical AI tracks: Segmentation, Enhancement, VQA, Report Generation, and Lesion Detection • Lite & Standard tiers (same data/metrics, different scaffolding) • Long-horizon tasks averaging ~33 agent turns • Full logs of actions, tokens, runtime, cost, and error codes We put today's frontier agents to the test: 🏆 Overall leaderboard 🥇 #Opus 4.6 — 66.5 🥈 #GLM-5 — 61.6 🥉 #Gemini 3.1 Pro — 59.0 4️⃣ #ChatGPT-5.4 — 55.3 5️⃣ #MiniMax-M2.5 — 51.6 6️⃣ #Qwen3.5 — 51.2 But no single model dominates everything: GLM-5 leads VQA, while Opus 4.6 leads most other tracks. ⚠️ The key finding Across thousands of runs, Validate is the weakest stage while Setup is the strongest. Today's agents are much better at making a pipeline run than ensuring it is correct before large-scale inference. 📉 The bottleneck isn't medical knowledge 🔍 Verification & recovery errors: 37.7% 📦 Deliverable & submission errors: 38.1% 🧠 Task-understanding errors: only 0.9% Even a single fired error code reduces the overall score by 48%. The next frontier for medical AI agents is not more knowledge—it's workflow reliability, verification, and self-correction. #AutoResearch #MedicalAI #AIAgents #HealthcareAI #AgenticAI

yuyinzhou_cs's tweet photo. 🩺 Can AI agents conduct medical research end-to-end, just like human researchers?

Introducing AutoMedBench — the first workflow-aware benchmark for medical AutoResearch agents. 🧪

📄 Paper: https://t.co/2H666zmPov
🌐 Project: https://t.co/bYKEVcfqEU
💻 Code: https://t.co/z37BfBBNm5
Medical agents are rapidly evolving from answering questions to conducting end-to-end medical-AI research: loading datasets, building pipelines, debugging failures, running inference, and submitting results.
Yet most benchmarks only evaluate the final answer.

A good score can hide a broken process. A failed run often reveals nothing about where the agent went wrong.

🔬 AutoMedBench evaluates the entire workflow.

Every run is decomposed into 5 stages:

📝 Plan → ⚙️ Setup → 🔍 Validate → 🚀 Inference → 📦 Submit

Instead of a single score, AutoMedBench diagnoses which stage succeeds or fails—and why.

📊 Benchmark scope

• 5 medical AI tracks: Segmentation, Enhancement, VQA, Report Generation, and Lesion Detection
• Lite & Standard tiers (same data/metrics, different scaffolding)
• Long-horizon tasks averaging ~33 agent turns
• Full logs of actions, tokens, runtime, cost, and error codes

We put today's frontier agents to the test:
🏆 Overall leaderboard

🥇 #Opus 4.6 — 66.5
🥈 #GLM-5 — 61.6
🥉 #Gemini 3.1 Pro — 59.0
4️⃣ #ChatGPT-5.4 — 55.3
5️⃣ #MiniMax-M2.5 — 51.6
6️⃣ #Qwen3.5 — 51.2

But no single model dominates everything: GLM-5 leads VQA, while Opus 4.6 leads most other tracks.
⚠️ The key finding

Across thousands of runs, Validate is the weakest stage while Setup is the strongest.

Today's agents are much better at making a pipeline run than ensuring it is correct before large-scale inference.
📉 The bottleneck isn't medical knowledge
🔍 Verification & recovery errors: 37.7%
📦 Deliverable & submission errors: 38.1%
🧠 Task-understanding errors: only 0.9%
Even a single fired error code reduces the overall score by 48%.
The next frontier for medical AI agents is not more knowledge—it's workflow reliability, verification, and self-correction.
#AutoResearch #MedicalAI #AIAgents #HealthcareAI #AgenticAI

121

104

14K

Top Tweets for #AutoMedBench

Last Seen Hashtags on Sotwe

Trends for you

Most Popular Users