Robots Digest 🤖 @robotsdigest - Twitter Profile

Robots Digest 🤖

@robotsdigest

about 13 hours ago

paper:https://t.co/idoMTZC3ef website:https://t.co/VI2KTVwH4S

0

3

1

0

114

Robots Digest 🤖

@robotsdigest

about 13 hours ago

PointAction tackles a key limitation of Video-Action Models: RGB video predictions are not directly actionable because 3D motion, geometry, and contact constraints remain implicit. The paper introduces dynamic 3D pointmaps as a universal action interface, jointly predicting future RGB frames and metric XYZ pointmaps before decoding them into robot actions. By separating a universal video-to-point model from a lightweight embodiment-specific point-to-action decoder, PointAction can scale 4D world modeling across tasks and robots while requiring far less action supervision. The same pretrained model transfers across different robot embodiments with only small decoder adaptations. Results are impressive. PointAction achieves state-of-the-art 4D generation quality, outperforms strong VLA and VAM baselines on RoboCasa, and successfully transfers to real robot arms unseen during pretraining. The key takeaway is that explicit 3D point dynamics may be a more scalable interface between generative world models and robot control than RGB-only video rollouts.

robotsdigest's tweet photo. PointAction tackles a key limitation of Video-Action Models: RGB video predictions are not directly actionable because 3D motion, geometry, and contact constraints remain implicit. The paper introduces dynamic 3D pointmaps as a universal action interface, jointly predicting future RGB frames and metric XYZ pointmaps before decoding them into robot actions.

By separating a universal video-to-point model from a lightweight embodiment-specific point-to-action decoder, PointAction can scale 4D world modeling across tasks and robots while requiring far less action supervision. The same pretrained model transfers across different robot embodiments with only small decoder adaptations.

Results are impressive. PointAction achieves state-of-the-art 4D generation quality, outperforms strong VLA and VAM baselines on RoboCasa, and successfully transfers to real robot arms unseen during pretraining. The key takeaway is that explicit 3D point dynamics may be a more scalable interface between generative world models and robot control than RGB-only video rollouts.

1

41

8

33

2K

Robots Digest 🤖

@robotsdigest

about 13 hours ago

A useful way to view PointAction is as a factorization of robot intelligence into two stages. First, learn a large-scale generative model that predicts how the 3D world evolves. Second, learn a small robot-specific decoder that translates those predicted 3D dynamics into control commands. This design shifts the expensive scaling problem from robot actions to geometric world modeling. Large video datasets can supervise RGB and 3D point dynamics, while only limited action data is needed for each new embodiment. The result is a system that achieves strong 4D generation, better manipulation performance than recent VLA and VAM baselines, and successful deployment on robot arms never seen during pretraining. It is a compelling step toward embodiment-agnostic robot foundation models.

robotsdigest's tweet photo. A useful way to view PointAction is as a factorization of robot intelligence into two stages. First, learn a large-scale generative model that predicts how the 3D world evolves. Second, learn a small robot-specific decoder that translates those predicted 3D dynamics into control commands.

This design shifts the expensive scaling problem from robot actions to geometric world modeling. Large video datasets can supervise RGB and 3D point dynamics, while only limited action data is needed for each new embodiment.

The result is a system that achieves strong 4D generation, better manipulation performance than recent VLA and VAM baselines, and successful deployment on robot arms never seen during pretraining. It is a compelling step toward embodiment-agnostic robot foundation models.

1

5

0

272

Robots Digest 🤖

@robotsdigest

about 15 hours ago

paper:https://t.co/rjZIhLjGN0 website:https://t.co/YfBxAKv79E codebase:https://t.co/6VcMtVONuv

0

3

0

124

Robots Digest 🤖

@robotsdigest

about 15 hours ago

Humanoid-GPT treats humanoid control like foundation model training. The team scaled from curated motion datasets to a 2B-frame corpus, clustered motions with a Harmonic Motion Embedding, trained hundreds of RL motion experts, then distilled them into a single Transformer tracker. The result is a zero-shot humanoid controller that tracks diverse human motions without finetuning.

1

42

5

22

2K

Robots Digest 🤖

@robotsdigest

about 15 hours ago

The real-world results are the strongest part. Humanoid-GPT transfers directly from simulation to a Unitree G1, tracking unseen dances, teleoperation commands, athletic movements, and recovery behaviors in real time. With TensorRT and ONNX optimization, the full system runs at under 1.5 ms latency on a single RTX 4090.

robotsdigest's tweet photo. The real-world results are the strongest part.

Humanoid-GPT transfers directly from simulation to a Unitree G1, tracking unseen dances, teleoperation commands, athletic movements, and recovery behaviors in real time. With TensorRT and ONNX optimization, the full system runs at under 1.5 ms latency on a single RTX 4090.

1

4

1

237

Robots Digest 🤖

@robotsdigest

1 day ago

paper:https://t.co/kfRwWW3FYo website:https://t.co/6Oj59mIMn4

0

1

0

81

Robots Digest 🤖

@robotsdigest

1 day ago

DynaFLIP argues that robot generalization is fundamentally a perception problem. Instead of training vision encoders to recognize what exists in a scene, it trains them to understand how the scene changes under action. The key idea is aligning image transitions, language instructions, and 3D flow into a shared representation, producing features that focus on control-relevant regions rather than visually salient distractions. Across simulation and real-world benchmarks, these dynamics-aware representations consistently outperform CLIP, DINOv2, SigLIP, R3M, and VC-1, showing that better perception directly translates into stronger manipulation performance.

1

16

3

7

629

Robots Digest 🤖

@robotsdigest

1 day ago

What stands out is the transferability. A single DynaFLIP encoder plugs into lightweight MLP policies, Diffusion Policies, and VLAs, consistently improving performance in both in-distribution and out-of-distribution settings. The largest gains appear under visual, spatial, and semantic distribution shifts, where models trained on static visual objectives often fail. By teaching perception to encode action-induced change rather than appearance alone, DynaFLIP provides evidence that dynamics-aware visual representations may be a key ingredient for more robust and generalizable robot learning.

robotsdigest's tweet photo. What stands out is the transferability. A single DynaFLIP encoder plugs into lightweight MLP policies, Diffusion Policies, and VLAs, consistently improving performance in both in-distribution and out-of-distribution settings. The largest gains appear under visual, spatial, and semantic distribution shifts, where models trained on static visual objectives often fail. By teaching perception to encode action-induced change rather than appearance alone, DynaFLIP provides evidence that dynamics-aware visual representations may be a key ingredient for more robust and generalizable robot learning.

1

4

0

1

215

Robots Digest 🤖

@robotsdigest

1 day ago

paper:https://t.co/hKMP39K7Lr website:https://t.co/pTWxGKVK8H

0

2

0

94

Robots Digest 🤖

@robotsdigest

1 day ago

RoboDream introduces a simple but powerful idea for robot data scaling as actions, objects, and scenes are separate, recombinable components. Instead of generating everything jointly, the model conditions on a robot-only trajectory, a scene prior, and an object prior, allowing demonstrations to be synthesized with novel objects, scenes, viewpoints, and task contexts while preserving physically valid robot motion. The key outcome is a scalable robot data engine. Existing demonstrations can be retrieved and "reborn" in new environments, and operators can perform prop-free teleoperation by acting without physical objects while the model later generates realistic interactions. Across real-world manipulation tasks, generated data consistently improves policy performance and significantly reduces the amount of costly real-world collection required.

2

38

6

22

1K

Robots Digest 🤖

@robotsdigest

1 day ago

RoboDream points toward a future where robot data collection looks much more like content creation than traditional teleoperation. By anchoring generation to valid robot trajectories and treating scenes and objects as interchangeable priors, the system can synthesize photorealistic demonstrations in zero-shot environments without task-specific fine-tuning. The broader implication is that scaling robot learning may not require scaling human teleoperation at the same rate. If trajectories, objects, scenes, and viewpoints can be recombined freely, a relatively small set of demonstrations can be transformed into a much larger and more diverse training corpus for manipulation policies.

1

2

0

1

371

Robots Digest 🤖

@robotsdigest

3 days ago

read more: https://t.co/mplZd8HKQD

0

2

0

223

Robots Digest 🤖

@robotsdigest

3 days ago

Mecka AI just raised $60M to tackle one of the biggest bottlenecks in robotics: data. Instead of building humanoids, they’re building the infrastructure layer that teaches them how humans actually interact with the world. Their approach is interesting: collect human activity data through body sensors, iPhones, custom rigs, and egocentric recordings, then turn that messy real-world behavior into training data for robot models. Think Scale AI, but for physical intelligence. As robotics shifts from hardware-constrained to data-constrained, companies like Mecka are betting that the winners won’t just be the labs building robots—but the ones building the data engine behind them.

3

30

6

9

2K

Robots Digest 🤖

@robotsdigest

3 days ago

blog:https://t.co/woxtXcWIWC code:https://t.co/8f2oO7Pkwb paper:https://t.co/CAXQSUxNhG

0

2

0

4

404

Robots Digest 🤖

@robotsdigest

3 days ago

NVIDIA just dropped Cosmos 3, a fully open world model for physical AI. Unlike traditional robot models that only predict actions, Cosmos 3 combines vision reasoning, world simulation, and action generation in one system. It can understand and generate text, images, video, sound, and robot actions while serving as a foundation for robotics, autonomous vehicles, and vision agents. Physical AI is moving from perception to imagination.

robotsdigest's tweet photo. NVIDIA just dropped Cosmos 3, a fully open world model for physical AI.

Unlike traditional robot models that only predict actions, Cosmos 3 combines vision reasoning, world simulation, and action generation in one system. It can understand and generate text, images, video, sound, and robot actions while serving as a foundation for robotics, autonomous vehicles, and vision agents.

Physical AI is moving from perception to imagination.

2

19

5

4

771

Robots Digest 🤖

@robotsdigest

3 days ago

The bigger story is the shift toward world models. NVIDIA claims Cosmos 3 can act as a VLM, a simulator, and the backbone for robot policies. Combined with the new Cosmos Coalition and open model releases, the goal is to make synthetic data generation, policy training, and evaluation dramatically faster. The race is no longer just about language models. It's about models that can predict and interact with the physical world.

2

1

0

466

Robots Digest 🤖

@robotsdigest

Last Seen Users on Sotwe

Trends for you

Most Popular Users