Tianhao Qi

18 days ago

It's hard to believe it's been over a decade since #UNet was introduced in 2015. I still remember reading that paper during my first year of my PhD and realizing how groundbreaking it was. Now, more than a decade later, we're excited to look back and look ahead. 🔬 A decade of biomedical image segmentation, in one map (2015–2025). https://t.co/5Mf6aY6cy6 From task-specific U-Nets to universal promptable foundation models, we summarize the key ideas, milestones, and emerging directions shaping the next generation of biomedical AI. 10 years of breakthroughs, distilled into a single timeline: 🩺 Task-specific U-Nets ➡️ Self-supervised learning ➡️ Multimodal foundation models ➡️ Universal promptable models The field has evolved from "one model per task" to "one model for many tasks."

yuyinzhou_cs's tweet photo. It's hard to believe it's been over a decade since #UNet was introduced in 2015. I still remember reading that paper during my first year of my PhD and realizing how groundbreaking it was.

Now, more than a decade later, we're excited to look back and look ahead.

🔬 A decade of biomedical image segmentation, in one map (2015–2025).
https://t.co/5Mf6aY6cy6
From task-specific U-Nets to universal promptable foundation models, we summarize the key ideas, milestones, and emerging directions shaping the next generation of biomedical AI.

10 years of breakthroughs, distilled into a single timeline:
🩺 Task-specific U-Nets
➡️ Self-supervised learning
➡️ Multimodal foundation models
➡️ Universal promptable models

The field has evolved from "one model per task" to "one model for many tasks."

Qi12Tom retweeted

25 days ago

🏆 Think your AI agent can do basic medical research end-to-end? Submit to our leaderboard (https://t.co/tXO9ioUPiN) and find out! We're launching #AutoMedBench, the first benchmark for evaluating Medical #AutoResearch agents across the entire research workflow—not just the final answer. 📄 Paper: https://t.co/X2G66j2Qx6 🌐 Project: https://t.co/bYKEVcfqEU 💻 Code: https://t.co/z37BfBBNm5 📝 Plan → ⚙️ Setup → 🔍 Validate → 🚀 Inference → 📦 Submit As AI agents move from answering medical questions to conducting end-to-end medical AI research, we need to measure where they succeed—and where they break. What we benchmark: • 24 tasks across segmentation, image enhancement, VQA, report generation, and lesion detection • 48 task-tier combinations spanning Lite and Standard settings • 6 frontier AI agents under a unified interface • Thousands of runs with detailed logs of stage performance, costs, tokens, wall time, and failure modes 📊 Current leaderboard: 🥇 #Opus 4.6: 66.5 🥈 #GLM-5: 61.6 🥉 #Gemini 3.1 Pro: 59.0 4️⃣ #ChatGPT-5.4: 55.3 5️⃣ #MiniMax-M2.5: 51.6 6️⃣ #Qwen3.5: 51.2 🔎 Key findings: ⚠️ Agents are better at completing workflows than producing high-quality scientific outputs. ⚠️ Validation is the weakest stage; Setup is the strongest. ⚠️ More scaffolding is not always better—some frontier agents actually perform worse with additional guidance. ⚠️ The dominant failures are verification and submission, not task understanding. 💡 Takeaway: The next frontier for research agents isn't just more medical knowledge—it's better workflow control, validation, error recovery, and artifact-level reasoning. #AIAgents #MedicalAI #AgenticAI #LLM #MultimodalAI #HealthcareAI #Benchmark

Qi12Tom retweeted

Tanishq Mathew Abraham, Ph.D.

28 days ago

🩺 Can AI agents conduct medical research end-to-end, just like human researchers? Introducing AutoMedBench — the first workflow-aware benchmark for medical AutoResearch agents. 🧪 📄 Paper: https://t.co/2H666zmPov 🌐 Project: https://t.co/bYKEVcfqEU 💻 Code: https://t.co/z37BfBBNm5 Medical agents are rapidly evolving from answering questions to conducting end-to-end medical-AI research: loading datasets, building pipelines, debugging failures, running inference, and submitting results. Yet most benchmarks only evaluate the final answer. A good score can hide a broken process. A failed run often reveals nothing about where the agent went wrong. 🔬 AutoMedBench evaluates the entire workflow. Every run is decomposed into 5 stages: 📝 Plan → ⚙️ Setup → 🔍 Validate → 🚀 Inference → 📦 Submit Instead of a single score, AutoMedBench diagnoses which stage succeeds or fails—and why. 📊 Benchmark scope • 5 medical AI tracks: Segmentation, Enhancement, VQA, Report Generation, and Lesion Detection • Lite & Standard tiers (same data/metrics, different scaffolding) • Long-horizon tasks averaging ~33 agent turns • Full logs of actions, tokens, runtime, cost, and error codes We put today's frontier agents to the test: 🏆 Overall leaderboard 🥇 #Opus 4.6 — 66.5 🥈 #GLM-5 — 61.6 🥉 #Gemini 3.1 Pro — 59.0 4️⃣ #ChatGPT-5.4 — 55.3 5️⃣ #MiniMax-M2.5 — 51.6 6️⃣ #Qwen3.5 — 51.2 But no single model dominates everything: GLM-5 leads VQA, while Opus 4.6 leads most other tracks. ⚠️ The key finding Across thousands of runs, Validate is the weakest stage while Setup is the strongest. Today's agents are much better at making a pipeline run than ensuring it is correct before large-scale inference. 📉 The bottleneck isn't medical knowledge 🔍 Verification & recovery errors: 37.7% 📦 Deliverable & submission errors: 38.1% 🧠 Task-understanding errors: only 0.9% Even a single fired error code reduces the overall score by 48%. The next frontier for medical AI agents is not more knowledge—it's workflow reliability, verification, and self-correction. #AutoResearch #MedicalAI #AIAgents #HealthcareAI #AgenticAI

yuyinzhou_cs's tweet photo. 🩺 Can AI agents conduct medical research end-to-end, just like human researchers?

Introducing AutoMedBench — the first workflow-aware benchmark for medical AutoResearch agents. 🧪

📄 Paper: https://t.co/2H666zmPov
🌐 Project: https://t.co/bYKEVcfqEU
💻 Code: https://t.co/z37BfBBNm5
Medical agents are rapidly evolving from answering questions to conducting end-to-end medical-AI research: loading datasets, building pipelines, debugging failures, running inference, and submitting results.
Yet most benchmarks only evaluate the final answer.

A good score can hide a broken process. A failed run often reveals nothing about where the agent went wrong.

🔬 AutoMedBench evaluates the entire workflow.

Every run is decomposed into 5 stages:

📝 Plan → ⚙️ Setup → 🔍 Validate → 🚀 Inference → 📦 Submit

Instead of a single score, AutoMedBench diagnoses which stage succeeds or fails—and why.

📊 Benchmark scope

• 5 medical AI tracks: Segmentation, Enhancement, VQA, Report Generation, and Lesion Detection
• Lite & Standard tiers (same data/metrics, different scaffolding)
• Long-horizon tasks averaging ~33 agent turns
• Full logs of actions, tokens, runtime, cost, and error codes

We put today's frontier agents to the test:
🏆 Overall leaderboard

🥇 #Opus 4.6 — 66.5
🥈 #GLM-5 — 61.6
🥉 #Gemini 3.1 Pro — 59.0
4️⃣ #ChatGPT-5.4 — 55.3
5️⃣ #MiniMax-M2.5 — 51.6
6️⃣ #Qwen3.5 — 51.2

But no single model dominates everything: GLM-5 leads VQA, while Opus 4.6 leads most other tracks.
⚠️ The key finding

Across thousands of runs, Validate is the weakest stage while Setup is the strongest.

Today's agents are much better at making a pipeline run than ensuring it is correct before large-scale inference.
📉 The bottleneck isn't medical knowledge
🔍 Verification & recovery errors: 37.7%
📦 Deliverable & submission errors: 38.1%
🧠 Task-understanding errors: only 0.9%
Even a single fired error code reduces the overall score by 48%.
The next frontier for medical AI agents is not more knowledge—it's workflow reliability, verification, and self-correction.
#AutoResearch #MedicalAI #AIAgents #HealthcareAI #AgenticAI

121

104

14K

Qi12Tom retweeted

KumaKuma

@Kuma_0Kumaaa

about 1 month ago

-- Your medical AI agent didn't fail because it lacked medical knowledge. It failed because it didn't verify its own work. A strong agent produces fewer errors and recovers gracefully from the ones it makes. 📖Our Paper -- AutoMedBench, is now online: https://t.co/mJSZFLuMHq 🌍 Leaderboard: https://t.co/Xm88BI4eJg AutoMedBench is a long-horizon medical imaging + multimodal benchmark with 5 tracks, averaging 33 agent turns per run. Tasks come in Lite and Standard tiers and are scored across 5 stages: Plan → Setup → Validate → Inference → Submit. Main finding: Validate is the weakest stage, while Setup is the strongest. Current agents are better at making pipelines executable than at verifying reliability. ⚠️ Error analysis confirms it: 🔍 verification/recovery errors: 37.7% 📦 deliverable/submission errors: 38.1% 🧠 task-understanding errors: only 0.9% Runs with one fired error code have a 48% lower overall score than clean runs.

Kuma_0Kumaaa's tweet photo. -- Your medical AI agent didn't fail because it lacked medical knowledge. It failed because it didn't verify its own work.

A strong agent produces fewer errors and recovers gracefully from the ones it makes.

📖Our Paper -- AutoMedBench, is now online: https://t.co/mJSZFLuMHq
🌍 Leaderboard: https://t.co/Xm88BI4eJg

AutoMedBench is a long-horizon medical imaging + multimodal benchmark with 5 tracks, averaging 33 agent turns per run. Tasks come in Lite and Standard tiers and are scored across 5 stages: Plan → Setup → Validate → Inference → Submit.

Main finding: Validate is the weakest stage, while Setup is the strongest. Current agents are better at making pipelines executable than at verifying reliability. ⚠️

Error analysis confirms it:
🔍 verification/recovery errors: 37.7%
📦 deliverable/submission errors: 38.1%
🧠 task-understanding errors: only 0.9%

Runs with one fired error code have a 48% lower overall score than clean runs.

Qi12Tom retweeted

@iScienceLuvr

about 1 month ago

Can medical AI research be automated with AI itself This new benchmark from NVIDIA and UC Santa Cruz aims to evaluate this: AutoMedBench: Towards Medical AutoResearch with Agentic AI Models "we present AutoMedBench, a workflow-aware benchmark for evaluating autonomous agents on end-to-end medical-AI research tasks" The benchmark covers 24 tasks across segmentation, question answering, report generation, etc. and across modalities like CT, X-ray, pathology, etc. The paper experiments with six frontier models (Opus 4.6, GLM-5, Gemini 3.1 Pro, GPT-5.4, MiniMax-M2.5, Qwen3.5-397B) and these models remain far from reliable medical AI researchers. While agents can often set up runnable pipelines, validation is consistently the weakest stage, and engineering failures dominate over understanding errors. Definitely curious to see how this performs with the newest generation of models/agents!

iScienceLuvr's tweet photo. Can medical AI research be automated with AI itself

This new benchmark from NVIDIA and UC Santa Cruz aims to evaluate this:

AutoMedBench: Towards Medical AutoResearch with Agentic AI Models

"we present AutoMedBench, a workflow-aware benchmark for evaluating autonomous agents on end-to-end medical-AI research tasks"

The benchmark covers 24 tasks across segmentation, question answering, report generation, etc. and across modalities like CT, X-ray, pathology, etc.

The paper experiments with six frontier models (Opus 4.6, GLM-5, Gemini 3.1 Pro, GPT-5.4, MiniMax-M2.5, Qwen3.5-397B) and these models remain far from reliable medical AI researchers. While agents can often set up runnable pipelines, validation is consistently the weakest stage, and engineering failures dominate over understanding errors.

Definitely curious to see how this performs with the newest generation of models/agents!

Qi12Tom retweeted

Sayak Paul

@RisingSayak

about 1 month ago

Post-training in diffusion models is a very under-appreciated topic. So, we're delighted to try to change that at ECCV'26. Announcing a dedicated tutorial for it w/ best pack 🔥 We'll cover several tracks & check out the link below to know more! @linoy_tsaban @hila_chefer

RisingSayak's tweet photo. Post-training in diffusion models is a very under-appreciated topic.

So, we're delighted to try to change that at ECCV'26. Announcing a dedicated tutorial for it w/ best pack 🔥

We'll cover several tracks & check out the link below to know more!

@linoy_tsaban @hila_chefer https://t.co/ATpYu76CoX

270

165

34K

Qi12Tom retweeted

about 1 month ago

#Claude is great — but building clinical-grade AI requires more active evidence retrieval. Introducing #ClinSeekAgent — a complete stack for building advanced medical agents through active multimodal evidence retrieval, powered by a comprehensive agent toolbox for dynamic clinical reasoning. 🧰 **Agent Toolbox**: 20 MCP tools for active evidence seeking (11 EHR · 3 web retrieval · 6 medical imaging tools) 🔧 **Framework**: ClinSeekAgent for orchestrating 📊 **Benchmark**: ClinSeek-Bench — paired Curated vs Agentic evaluation (text-only EHR + multimodal) 🧠 **Data**: high-quality Claude Opus 4.6 evidence-seeking trajectories (for SFT) 🤖 **Model**: ClinSeek-35B-A3B — open-source SOTA clinical agent 📄 https://t.co/T1Qn64LkUv 💻 https://t.co/Jai6zHfhgH 🤗 https://t.co/RJaES2hgON

116

123

13K

Qi12Tom retweeted

about 1 month ago

Clinical AI shouldn't just consume evidence handed to it — it should actively seek evidence, e.g., linking multimodal data, analyzing patient context, and retrieving external knowledge to support clinical reasoning 🔎 Introducing ClinSeekAgent — our automated agentic framework for active multimodal evidence seeking in clinical reasoning., achieving +15.1 F1 compared to #claude Opus 4.6 Paper: https://t.co/YtnI5FnArd Code: https://t.co/Jai6zHfhgH

yuyinzhou_cs's tweet photo. Clinical AI shouldn't just consume evidence handed to it — it should actively seek evidence, e.g., linking multimodal data, analyzing patient context, and retrieving external knowledge to support clinical reasoning 🔎

Introducing ClinSeekAgent — our automated agentic framework for active multimodal evidence seeking in clinical reasoning., achieving +15.1 F1 compared to #claude Opus 4.6
Paper: https://t.co/YtnI5FnArd
Code: https://t.co/Jai6zHfhgH

17K

Qi12Tom retweeted

Cihang Xie

@cihangxie

4 months ago

Day 5 building MetaClaw 🦞 Mad Max Mode: your agent now learns 24/7 ⚡ Chat → instant skill updates 🧠 Away → RL keeps training Try it 👉 https://t.co/p5yqdWmwzR

Qi12Tom retweeted

Cihang Xie

@cihangxie

4 months ago

While Google's Veo has mastered visual realism, capturing the causal logic of the physical world—like the state transition from 'whole' to 'sliced'—remains a major challenge. 🍅🔪 Excited to share our latest work, CAST, that improves Veo to generate more coherent storylines! It acts as a lightweight, plug-and-play adapter that tracks visual history to enforce true state consistency. 🧵👇

cihangxie's tweet photo. While Google's Veo has mastered visual realism, capturing the causal logic of the physical world—like the state transition from 'whole' to 'sliced'—remains a major challenge. 🍅🔪

Excited to share our latest work, CAST, that improves Veo to generate more coherent storylines! It acts as a lightweight, plug-and-play adapter that tracks visual history to enforce true state consistency. 🧵👇

6 months ago

Cool dance by @UnitreeRobotics — it reminds me of that summer when my teammates and I were hands-on building an autonomous car together.

over 1 year ago

@CVPR @StarryFX @MiaoweiW @sheriscientist @aliathar94 @_vztu @CVPR Could committee release decisions for most done papers first through OpenReview at this moment?

over 1 year ago

@sheriscientist @CVPR @aliathar94 @_vztu @CVPR I agree, how much time is needed to verify the few part of papers? Why not directly release decisions for the most done papers?

over 1 year ago

@CVPR @aliathar94 @_vztu You mean 6 PCs need to verify decisions for 10k+ papers？

19K