When every generalist robot model scores 95%+ on a benchmark, the numbers become meaningless.
What if we built a photorealistic benchmark that never saturates and can generate new scenes and tasks with AI Workflows in minutes?
We introduce RoboLab! 🧵(1/6)
@chris_j_paxton We built RoboLab to answer the question "How should we *objectively* evaluate *real-world* generalist policies at scale?" which is exactly where we differentiate.
The noise, though, is a problem of statistics and quantifying the sim2real gap. Unfortunately this is not solved.
Cosmos3 (post-trained on DROID) surpassed strong VLA & WAM baselines to rank #1 on RoboLab
All the compute FLOPs invested during the massive Cosmos3 pre-training and mid-training contribute to unlocking a better robot foundation model.😄
🎉 We added 2 SOTA WAMs to the RoboLab Leaderboard 🎉
Current leaders on RoboLab-120 (specific instr.):
🥇Cosmos3-Nano-Policy (39.7%)
🥈π0.5 (28.1%)
🥉DreamZero (28.1%)
→ See full results at: https://t.co/Le8jykn5jo
→ All policy clients available at: https://t.co/wQH4Py6zJ8
When every generalist robot model scores 95%+ on a benchmark, the numbers become meaningless.
What if we built a photorealistic benchmark that never saturates and can generate new scenes and tasks with AI Workflows in minutes?
We introduce RoboLab! 🧵(1/6)
Generalist robot policies need a benchmark that works across any robot and any policy. 🦾
Introducing RoboLab, a high‑fidelity simulation benchmark built on NVIDIA Isaac and Omniverse to evaluate generalist robot policies in diverse, photoreal, physics‑based environments.
Coming soon to the NVIDIA Isaac Lab‑Arena roadmap for large‑scale, robotic policy evaluation. 📖 https://t.co/wW472SHXPz
#NationalRoboticsWeek
RoboLab comes with RoboLab-120 — a curated, diverse benchmark of 120 tasks to get started.
Set up and run in <20 min. (6/6)
Try it out 👇
🌐 https://t.co/pNMITqaCus
📄 https://t.co/CDS0tpFnZ0
💻 https://t.co/bnJmhPMXa5
→ Customization: Comes with 200+ objects, 100+ backgrounds, lighting, camera poses… don’t like it? No problem, add your own
→ Diagnostics: motion quality, failure events, + sensitivity analysis for failure attribution
(5/6)