Top Tweets for #AutoMedBench
🏆 Think your AI agent can do basic medical research end-to-end? Submit to our leaderboard (https://t.co/tXO9ioUPiN) and find out!
We're launching #AutoMedBench, the first benchmark for evaluating Medical #AutoResearch agents across the entire research workflow—not just the final answer.
📄 Paper: https://t.co/X2G66j2Qx6
🌐 Project: https://t.co/bYKEVcfqEU
💻 Code: https://t.co/z37BfBBNm5
📝 Plan → ⚙️ Setup → 🔍 Validate → 🚀 Inference → 📦 Submit
As AI agents move from answering medical questions to conducting end-to-end medical AI research, we need to measure where they succeed—and where they break.
What we benchmark:
• 24 tasks across segmentation, image enhancement, VQA, report generation, and lesion detection
• 48 task-tier combinations spanning Lite and Standard settings
• 6 frontier AI agents under a unified interface
• Thousands of runs with detailed logs of stage performance, costs, tokens, wall time, and failure modes
📊 Current leaderboard:
🥇 #Opus 4.6: 66.5
🥈 #GLM-5: 61.6
🥉 #Gemini 3.1 Pro: 59.0
4️⃣ #ChatGPT-5.4: 55.3
5️⃣ #MiniMax-M2.5: 51.6
6️⃣ #Qwen3.5: 51.2
🔎 Key findings:
⚠️ Agents are better at completing workflows than producing high-quality scientific outputs.
⚠️ Validation is the weakest stage; Setup is the strongest.
⚠️ More scaffolding is not always better—some frontier agents actually perform worse with additional guidance.
⚠️ The dominant failures are verification and submission, not task understanding.
💡 Takeaway:
The next frontier for research agents isn't just more medical knowledge—it's better workflow control, validation, error recovery, and artifact-level reasoning.
#AIAgents #MedicalAI #AgenticAI #LLM #MultimodalAI #HealthcareAI #Benchmark
🩺 Can AI agents conduct medical research end-to-end, just like human researchers?
Introducing AutoMedBench — the first workflow-aware benchmark for medical AutoResearch agents. 🧪
📄 Paper: https://t.co/2H666zmPov
🌐 Project: https://t.co/bYKEVcfqEU
💻 Code: https://t.co/z37BfBBNm5
Medical agents are rapidly evolving from answering questions to conducting end-to-end medical-AI research: loading datasets, building pipelines, debugging failures, running inference, and submitting results.
Yet most benchmarks only evaluate the final answer.
A good score can hide a broken process. A failed run often reveals nothing about where the agent went wrong.
🔬 AutoMedBench evaluates the entire workflow.
Every run is decomposed into 5 stages:
📝 Plan → ⚙️ Setup → 🔍 Validate → 🚀 Inference → 📦 Submit
Instead of a single score, AutoMedBench diagnoses which stage succeeds or fails—and why.
📊 Benchmark scope
• 5 medical AI tracks: Segmentation, Enhancement, VQA, Report Generation, and Lesion Detection
• Lite & Standard tiers (same data/metrics, different scaffolding)
• Long-horizon tasks averaging ~33 agent turns
• Full logs of actions, tokens, runtime, cost, and error codes
We put today's frontier agents to the test:
🏆 Overall leaderboard
🥇 #Opus 4.6 — 66.5
🥈 #GLM-5 — 61.6
🥉 #Gemini 3.1 Pro — 59.0
4️⃣ #ChatGPT-5.4 — 55.3
5️⃣ #MiniMax-M2.5 — 51.6
6️⃣ #Qwen3.5 — 51.2
But no single model dominates everything: GLM-5 leads VQA, while Opus 4.6 leads most other tracks.
⚠️ The key finding
Across thousands of runs, Validate is the weakest stage while Setup is the strongest.
Today's agents are much better at making a pipeline run than ensuring it is correct before large-scale inference.
📉 The bottleneck isn't medical knowledge
🔍 Verification & recovery errors: 37.7%
📦 Deliverable & submission errors: 38.1%
🧠 Task-understanding errors: only 0.9%
Even a single fired error code reduces the overall score by 48%.
The next frontier for medical AI agents is not more knowledge—it's workflow reliability, verification, and self-correction.
#AutoResearch #MedicalAI #AIAgents #HealthcareAI #AgenticAI

Last Seen Hashtags on Sotwe
wata
Seen from Turkey
teenagegirl
Seen from United States
handjob #handjob
Seen from Italy
theYexperience
Seen from United States
NoLimit
Seen from Netherlands
elaziggay
Seen from Australia
ebonypyt
Seen from United States
wetheVillage
Seen from United States
vagman
Seen from United Kingdom
waria crot
Seen from Indonesia
Most Popular Users

Elon Musk 
@elonmusk
240.7M followers

Barack Obama 
@barackobama
119.2M followers

Donald J. Trump 
@realdonaldtrump
111.7M followers

Cristiano Ronaldo 
@cristiano
110.7M followers

Narendra Modi 
@narendramodi
107M followers

Rihanna 
@rihanna
97.7M followers

NASA 
@nasa
92.2M followers

Justin Bieber 
@justinbieber
90.9M followers

KATY PERRY 
@katyperry
87.7M followers

Taylor Swift 
@taylorswift13
81.6M followers

Lady Gaga 
@ladygaga
73.1M followers

Virat Kohli 
@imvkohli
70M followers

Kim Kardashian 
@kimkardashian
69.8M followers

YouTube 
@youtube
68.7M followers

Bill Gates 
@billgates
63.9M followers

Neymar Jr 
@neymarjr
62.7M followers

The Ellen Show
@theellenshow
62.4M followers

CNN 
@cnn
61.9M followers

X 
@x
60.8M followers

Selena Gomez 
@selenagomez
60.8M followers

