🚀 Multimodal agents are improving fast, but real-world deployment is still a nightmare: video uploads are extremely expensive, frames are redundant, and prompts bloat quickly.
Enter VisualClaw 🦞👓 — a new framework that fixes the system design bottleneck so you don't have to retrain your VLM!
💡 How it works (See, Streamline, Meta-Evolve):
• Edge cascade filtering: Keeps only salient frames on-device, sending far less video to the VLM. A 1-hour 1fps stream is 3,600 frames—we filter the noise first.
• Hot/cold skills: Dynamically manages prompt bloat.
• Memory-guided evolution: The scaffold learns from experience. Correct examples enter memory and failures trigger an evolver to build new skills.
📊 The Results:
• Massive cost drops: -98.1% API cost vs full-frame upload (and up to -99.3% on Video-MME long!).
• Performance bumps: +15.8% peak accuracy on EgoSchema.
• Agentic gains: +3.2 macro accuracy with a Claude Code backend.
🏟️ We are also releasing VisualClawArena:
A rigorous 5-stage multimodal benchmark where agents must navigate video clips, documents, user files, dynamic updates, and executable checks (averaging 24.4 rounds per scenario).
Less video. Lower cost. Better adaptation. ⚡
Read the paper & grab the code:
Project Page: https://t.co/XSXRPxiLov
Arxiv: https://t.co/M7DlShge4l
Code: https://t.co/c1K0aY7I5h
It's hard to believe it's been over a decade since #UNet was introduced in 2015. I still remember reading that paper during my first year of my PhD and realizing how groundbreaking it was.
Now, more than a decade later, we're excited to look back and look ahead.
🔬 A decade of biomedical image segmentation, in one map (2015–2025).
https://t.co/5Mf6aY6cy6
From task-specific U-Nets to universal promptable foundation models, we summarize the key ideas, milestones, and emerging directions shaping the next generation of biomedical AI.
10 years of breakthroughs, distilled into a single timeline:
🩺 Task-specific U-Nets
➡️ Self-supervised learning
➡️ Multimodal foundation models
➡️ Universal promptable models
The field has evolved from "one model per task" to "one model for many tasks."
🏆 Think your AI agent can do basic medical research end-to-end? Submit to our leaderboard (https://t.co/tXO9ioUPiN) and find out!
We're launching #AutoMedBench, the first benchmark for evaluating Medical #AutoResearch agents across the entire research workflow—not just the final answer.
📄 Paper: https://t.co/X2G66j2Qx6
🌐 Project: https://t.co/bYKEVcfqEU
💻 Code: https://t.co/z37BfBBNm5
📝 Plan → ⚙️ Setup → 🔍 Validate → 🚀 Inference → 📦 Submit
As AI agents move from answering medical questions to conducting end-to-end medical AI research, we need to measure where they succeed—and where they break.
What we benchmark:
• 24 tasks across segmentation, image enhancement, VQA, report generation, and lesion detection
• 48 task-tier combinations spanning Lite and Standard settings
• 6 frontier AI agents under a unified interface
• Thousands of runs with detailed logs of stage performance, costs, tokens, wall time, and failure modes
📊 Current leaderboard:
🥇 #Opus 4.6: 66.5
🥈 #GLM-5: 61.6
🥉 #Gemini 3.1 Pro: 59.0
4️⃣ #ChatGPT-5.4: 55.3
5️⃣ #MiniMax-M2.5: 51.6
6️⃣ #Qwen3.5: 51.2
🔎 Key findings:
⚠️ Agents are better at completing workflows than producing high-quality scientific outputs.
⚠️ Validation is the weakest stage; Setup is the strongest.
⚠️ More scaffolding is not always better—some frontier agents actually perform worse with additional guidance.
⚠️ The dominant failures are verification and submission, not task understanding.
💡 Takeaway:
The next frontier for research agents isn't just more medical knowledge—it's better workflow control, validation, error recovery, and artifact-level reasoning.
#AIAgents #MedicalAI #AgenticAI #LLM #MultimodalAI #HealthcareAI #Benchmark
🩺 Can AI agents conduct medical research end-to-end, just like human researchers?
Introducing AutoMedBench — the first workflow-aware benchmark for medical AutoResearch agents. 🧪
📄 Paper: https://t.co/2H666zmPov
🌐 Project: https://t.co/bYKEVcfqEU
💻 Code: https://t.co/z37BfBBNm5
Medical agents are rapidly evolving from answering questions to conducting end-to-end medical-AI research: loading datasets, building pipelines, debugging failures, running inference, and submitting results.
Yet most benchmarks only evaluate the final answer.
A good score can hide a broken process. A failed run often reveals nothing about where the agent went wrong.
🔬 AutoMedBench evaluates the entire workflow.
Every run is decomposed into 5 stages:
📝 Plan → ⚙️ Setup → 🔍 Validate → 🚀 Inference → 📦 Submit
Instead of a single score, AutoMedBench diagnoses which stage succeeds or fails—and why.
📊 Benchmark scope
• 5 medical AI tracks: Segmentation, Enhancement, VQA, Report Generation, and Lesion Detection
• Lite & Standard tiers (same data/metrics, different scaffolding)
• Long-horizon tasks averaging ~33 agent turns
• Full logs of actions, tokens, runtime, cost, and error codes
We put today's frontier agents to the test:
🏆 Overall leaderboard
🥇 #Opus 4.6 — 66.5
🥈 #GLM-5 — 61.6
🥉 #Gemini 3.1 Pro — 59.0
4️⃣ #ChatGPT-5.4 — 55.3
5️⃣ #MiniMax-M2.5 — 51.6
6️⃣ #Qwen3.5 — 51.2
But no single model dominates everything: GLM-5 leads VQA, while Opus 4.6 leads most other tracks.
⚠️ The key finding
Across thousands of runs, Validate is the weakest stage while Setup is the strongest.
Today's agents are much better at making a pipeline run than ensuring it is correct before large-scale inference.
📉 The bottleneck isn't medical knowledge
🔍 Verification & recovery errors: 37.7%
📦 Deliverable & submission errors: 38.1%
🧠 Task-understanding errors: only 0.9%
Even a single fired error code reduces the overall score by 48%.
The next frontier for medical AI agents is not more knowledge—it's workflow reliability, verification, and self-correction.
#AutoResearch #MedicalAI #AIAgents #HealthcareAI #AgenticAI
-- Your medical AI agent didn't fail because it lacked medical knowledge. It failed because it didn't verify its own work.
A strong agent produces fewer errors and recovers gracefully from the ones it makes.
📖Our Paper -- AutoMedBench, is now online: https://t.co/mJSZFLuMHq
🌍 Leaderboard: https://t.co/Xm88BI4eJg
AutoMedBench is a long-horizon medical imaging + multimodal benchmark with 5 tracks, averaging 33 agent turns per run. Tasks come in Lite and Standard tiers and are scored across 5 stages: Plan → Setup → Validate → Inference → Submit.
Main finding: Validate is the weakest stage, while Setup is the strongest. Current agents are better at making pipelines executable than at verifying reliability. ⚠️
Error analysis confirms it:
🔍 verification/recovery errors: 37.7%
📦 deliverable/submission errors: 38.1%
🧠 task-understanding errors: only 0.9%
Runs with one fired error code have a 48% lower overall score than clean runs.
Can medical AI research be automated with AI itself
This new benchmark from NVIDIA and UC Santa Cruz aims to evaluate this:
AutoMedBench: Towards Medical AutoResearch with Agentic AI Models
"we present AutoMedBench, a workflow-aware benchmark for evaluating autonomous agents on end-to-end medical-AI research tasks"
The benchmark covers 24 tasks across segmentation, question answering, report generation, etc. and across modalities like CT, X-ray, pathology, etc.
The paper experiments with six frontier models (Opus 4.6, GLM-5, Gemini 3.1 Pro, GPT-5.4, MiniMax-M2.5, Qwen3.5-397B) and these models remain far from reliable medical AI researchers. While agents can often set up runnable pipelines, validation is consistently the weakest stage, and engineering failures dominate over understanding errors.
Definitely curious to see how this performs with the newest generation of models/agents!
Post-training in diffusion models is a very under-appreciated topic.
So, we're delighted to try to change that at ECCV'26. Announcing a dedicated tutorial for it w/ best pack 🔥
We'll cover several tracks & check out the link below to know more!
@linoy_tsaban@hila_chefer
#Claude is great — but building clinical-grade AI requires more active evidence retrieval.
Introducing #ClinSeekAgent — a complete stack for building advanced medical agents through active multimodal evidence retrieval, powered by a comprehensive agent toolbox for dynamic clinical reasoning.
🧰 **Agent Toolbox**: 20 MCP tools for active evidence seeking (11 EHR · 3 web retrieval · 6 medical imaging tools)
🔧 **Framework**: ClinSeekAgent for orchestrating
📊 **Benchmark**: ClinSeek-Bench — paired Curated vs Agentic evaluation (text-only EHR + multimodal)
🧠 **Data**: high-quality Claude Opus 4.6 evidence-seeking trajectories (for SFT)
🤖 **Model**: ClinSeek-35B-A3B — open-source SOTA clinical agent
📄 https://t.co/T1Qn64LkUv
💻 https://t.co/Jai6zHfhgH
🤗 https://t.co/RJaES2hgON
Clinical AI shouldn't just consume evidence handed to it — it should actively seek evidence, e.g., linking multimodal data, analyzing patient context, and retrieving external knowledge to support clinical reasoning 🔎
Introducing ClinSeekAgent — our automated agentic framework for active multimodal evidence seeking in clinical reasoning., achieving +15.1 F1 compared to #claude Opus 4.6
Paper: https://t.co/YtnI5FnArd
Code: https://t.co/Jai6zHfhgH
Day 5 building MetaClaw 🦞
Mad Max Mode: your agent now learns 24/7
⚡ Chat → instant skill updates
🧠 Away → RL keeps training
Try it 👉 https://t.co/p5yqdWmwzR
While Google's Veo has mastered visual realism, capturing the causal logic of the physical world—like the state transition from 'whole' to 'sliced'—remains a major challenge. 🍅🔪
Excited to share our latest work, CAST, that improves Veo to generate more coherent storylines! It acts as a lightweight, plug-and-play adapter that tracks visual history to enforce true state consistency. 🧵👇
@sheriscientist@CVPR@aliathar94@_vztu@CVPR I agree, how much time is needed to verify the few part of papers? Why not directly release decisions for the most done papers?