Today, RLWRLD unveils RLDX-1 — our proprietary Robotics Foundation Model.
Across all 8 public benchmarks, RLDX-1 outperforms leading SOTA models including #NVIDIA#GR00T and Physical Intelligence #π0 — delivering state-of-the-art performance among open robotics foundation models.
🎯 A 'Dexterity-First' Philosophy
The industry assumes dexterity will follow once intelligence is solved. We see it the other way around.
Dexterity isn't downstream of intelligence — it's the path intelligence must take to act in the physical world. Real industrial work with five-finger robotic hands depends on signals vision alone can't capture: force (torque), tactile feedback, and the precise moment of contact.
🧠 MSAT — Multi-Stream Action Transformer
Where conventional VLAs collapse every input into a single transformer stream, MSAT gives each modality — vision, language, action, touch, memory — its own dedicated stream, then unifies them through joint attention. Force, tactile signals, and long-term memory are handled by purpose-built Physics and Memory modules.
The result: one model that can see, feel, remember, and adapt.
📊 Performance Highlights
RoboCasa Kitchen — 70.6: the first VLA model to cross the 70-point threshold
GR-1 Tabletop — 58.7: +10.7 percentage points over NVIDIA GR00T N1.6
LIBERO-Plus — 86.7%: top score across 7 robustness variables
Pot-to-Cup Pouring on WIRobotics ALLEX — 70.8%: nearly 2× the comparison models, which remained in the high-30% range.
We're also releasing DexBench — our industry-grounded benchmark for dexterous manipulation, defined across five domains: Grasp Diversity, Spatial Precision, Temporal Precision, Contact Precision, and Context Awareness.
🔓 Open Release
Three checkpoints (8.1B parameters each), live now on GitHub and Hugging Face:
RLDX-1-PT — pre-training
RLDX-1-MT-ALLEX — mid-training for ALLEX
RLDX-1-MT-DROID — mid-training for DROID
⚙️ Built on NVIDIA's Cloud-to-Edge Stack
Training and simulation on Isaac GR00T, Isaac Lab, Isaac Sim, and cuRobo. Compute on NVIDIA H100 and A100 GPUs. Edge inference on Jetson AGX Thor with TensorRT. Our collaborations with NVIDIA, AWS, and Microsoft continue across both research and deployment.
��� What's Next: The 4D+ World Model
Video-based world models will never surface what isn't in the pixels — contact torque, tactile signals, robot state. Our 4D+ World Model integrates these directly with vision, language, and action across the temporal dimension, predicting and generating the full physical world. RLDX-1 is the first milestone on that roadmap.
📍 Join us at Dexterity Night in San Francisco on May 13 — followed by launch events in Japan and Korea.
🔗 Explore RLDX-1 on GitHub and Hugging Face.
https://t.co/kT6aX3qo8P
#RLWRL #RLDX1 #PhysicalAI #RoboticsFoundationModel #VLA #Humanoid #Dexterity #FoundationModel #Robotics #AI
Three identical boxes. A mouse is placed into one. A moment later, a go signal, and the robot has to pick.
Without memory, the policy forgets which box.
RLDX-1's Memory Module is built on HAMLET (#ICLR2026 in Rio 🇧🇷), integrated into the full architecture. Open-sourced in two weeks. Stay tuned. 👉 https://t.co/u1Qe48LSmn
#RLDX #RLWRLD #PhysicalAI #Robotics #Dexterity #FoundationModels #Automation #Manufacturing #VLA
VLAs (from VLMs) ❌ => WAMs (from Video Models) ✅
Why WAMs?
1️⃣ World Physics: VLMs know the internet, but Video Models implicitly model the physical laws essential for manipulation.
2️⃣ The "GPT Direction": VLAs are like BERT (rely heavily on task-specific post-training). WAMs are like GPT (pre-train & prompt), unlocking incredible zero-shot transfer!
What I want to see in 2026:
📈 Scaling Laws: We will see much clearer scaling laws for robotics compared to VLAs.
🤝 Human-to-Robot Transfer: Unlocking massive transfer capabilities using video as a shared representation space.
🤖 Zero-Shot Mastery: Moving from short-horizon tasks to long-horizon, dexterous manipulation without task-specific demonstrations.
We recently open-sourced the checkpoints, training and inference code.
Dive into the research! 👇
📄 Paper: https://t.co/jFEwebgyBH
💻 Code: https://t.co/4sZ5RoFmgB
🤗 HF: https://t.co/nPGoLYCPyq
🚀 DreamZero training code is LIVE — train your own WAM (aka VAM)!
🔧 Replicate DROID from-scratch training
📊 Run evals on sim (DROID-Sim, MolmoSpaces, Polaris) & real-world (RoboArena)
No 2 GB200s for real-time inference? No problem — let NVIDIA carry that burden ���. Sign up for our API and jump into prompting new tasks! (e.g. "fan the burger" 🍔, totally unseen verb/task from DROID)
Coming soon: new embodiment/robot fine-tuning initialized from our DreamZero-AGIBot checkpoint. Stay tuned! 🤖
🔗 https://t.co/50wtDaDO8E
🤖🤖Very excited to finally share our new work “Action Chunking and Exploratory Data Collection Yield Exponential Improvements in Behavior Cloning for Continuous Control”
Everyone in robotics does action-chunking, but why does it actually work?🤔🤔And, what can theory tell us about the properties of data we should be collecting for robotic behavior cloning? 🧵1/N
Introducing VL-JEPA: Vision-Language Joint Embedding Predictive Architecture for streaming, live action recognition, retrieval, VQA, and classification tasks with better performance and higher efficiency than large VLMs.
• VL-JEPA is the first non-generative model that can perform general-domain vision-language tasks in real-time, built on a joint embedding predictive architecture.
• We demonstrate in controlled experiments that VL-JEPA, trained with latent space embedding prediction, outperforms VLMs that rely on data space token prediction.
• We show that VL-JEPA delivers significant efficiency gains over VLMs for online video streaming applications, thanks to its non-autoregressive design and native support for selective decoding.
• We highlight that our VL-JEPA model, with an unified model architecture, can effectively handle a wide range of classification, retrieval, and VQA tasks at the same time.
by @Delong0_0@MustafaShukor1@TheoMoutakanni@willyhcchung Jade Lei Yu Tejaswi Kasarla @AllenBolourchi@ylecun@pascalefung
https://t.co/oUnjCaMKVv
1/ The future of general-purpose robotics will be decided by one major question: which flavor of data scales reasoning? Every major lab represents a different bet.
Over the past 3 months, @adam_patni, @vriishin, and I read the core research papers, spoke with staff at the major labs, and mapped the talent pool. This has completely changed how we think about general-purpose robotics.
Our paper builds intuition, step-by step, across the 2025 frontier: from architectures → evals → data → industry dynamics. Each layer reveals a different bottleneck, but they all converge on one truth—data decides everything.
Our takeaways + process below👇
If you want access to our graph (sound on), comment or DM me
three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right.
today, we introduce Representation Autoencoders (RAE).
>> Retire VAEs. Use RAEs. 👇(1/n)
Doing so called AI+robotics
30% time debugging real robot deployment
30% time fixing simulation and looking at tensorboard or wandb
30% time meetings and all kinds of non-research activities
10% time spin my brain to get a bit intellectual contributions with AI
Watch ALLEX in action.
From delicate gestures to precise object handling, our humanoid shows next-level hand dexterity and Physical AI at the @OpenAI Seoul Open Event.
This is how @RLWRLD_ai is redefining real-world robotics 🤖✨
#RLWRLD#OpenAI#PhysicalAI#dexterity #AIrobotics #Seoul
Just saw this awesome demo by @kaysorin — really proud to share ALLEX in action at the OpenAI Seoul Open Event! Watching it move, interact, and demonstrate real-world dexterity was something special. 🤖🙌
Huge shoutout to everyone involved — pushing the boundaries of what’s possible with physical AI.
#RLWRLD #OpenAI #Robotics #PhysicalAI #Dexterity #Innovation #Seoul
Opensourcing a useful tool to calibrate camera extrinsics painlessly in a minute, no checkerboards! It's based on EasyHEC, using differentiable rendering to optimize extrinsics given object meshes+poses. Crazy that even a piece of paper works too.
Code:
https://t.co/CSmD2iIXuK
How to generate billion-scale manipulation demonstrations easily? Let us leverage generative models! 🤖✨
We introduce Dex1B, a framework that generates 1 BILLION diverse dexterous hand demonstrations for both grasping 🖐️and articulation 💻 tasks using a simple C-VAE model.
🚨 Want models to better utilize and ground on the provided knowledge? We introduce Context-INformed Grounding Supervision (CINGS)! Training LLM with CINGS significantly boosts grounding abilities in both text and vision-language models compared to standard instruction tuning.
Q-learning is not yet scalable
https://t.co/hoYUdAAeGZ
I wrote a blog post about my thoughts on scalable RL algorithms.
To be clear, I'm still highly optimistic about off-policy RL and Q-learning! I just think we haven't found the right solution yet (the post discusses why).
🚨 New Paper 🧵
How effectively do reasoning models reevaluate their thought? We find that:
- Models excel at identifying unhelpful thoughts but struggle to recover from them
- Smaller models can be more robust
- Self-reevaluation ability is far from true meta-cognitive awareness
Excited to present FastTD3: a simple, fast, and capable off-policy RL algorithm for humanoid control -- with an open-source code to run your own humanoid RL experiments in no time!
Thread below 🧵
We took a short break from robotics to build a human-level agent to play Competitive Pokémon. Partially observed. Stochastic. Long-horizon. Now mastered with Offline RL + Transformers. Our agent, trained on 475k+ human battles,��hits the top 10% on Pokémon Showdown leaderboards. No search or heuristics, just sequence modeling.
Today, we're open-sourcing our Metamon platform with our algorithms, data, and environments:
🌐 https://t.co/4YrQEk2QeX
We are excited to see how our work accelerates research on building generally capable AI agents, and more importantly, inspires the next generation of Pokémon trainers!