Robots need memory to handle complex, multi-step tasks. Can we design an effective method for this?
We propose MemER, a hierarchical VLA policy that learns what visual frames to remember across multiple long-horizon tasks, enabling memory-aware manipulation.
(1/5)
Pushing online RL to the next level -- exposing RL Token from the π-0.6 model for online RL achieves superhuman performance with as little as 15 minutes of data.
We equipped PI policies with memory!
And taught our robots to do long-horizon real world tasks such as preparing the items for a recipe, cooking a grilled cheese and cleaning the kitchen!
Most robot policies today still largely lack memory: they make all their decisions based on what they can see right now. MemER aims to change that by learning which frames are important; this lets it deal with tasks like object search. @ajaysridhar0, @jenpan_,
and @satviks107Sharma tell us about how to achieve this fundamental capability for long-horizon task execution.
Watch Episode #54 of RoboPapers with @micoolcho and @chris_j_paxton to learn more!
Rollouts in the real world are slow and expensive. What if we could rollout trajectories entirely inside a world model (WM)?
Introducing 🚀Ctrl-World🚀, a generative manipulation WM that can interact with advanced VLA policy in imagination. 🧵1/6
VLAs are great, but most lack long-term memory humans use for everyday tasks. This is a critical gap for solving complex, long-horizon problems.
Introducing MemER: Scaling Up Memory for Robot Control via Experience Retrieval.
A thread 🧵 (1/8)
Robots need memory to handle complex, multi-step tasks. Can we design an effective method for this?
We propose MemER, a hierarchical VLA policy that learns what visual frames to remember across multiple long-horizon tasks, enabling memory-aware manipulation.
(1/5)
We design 3 tasks that entail using memory in distinct ways, including recalling object locations, keeping track of completed actions, and counting repeated steps. MemER significantly outperforms baselines that naively scale history, while retaining low-latency inference.
(4/5)
Long histories are computationally expensive, while naive subsampling misses crucial context.
We train a high-level policy to select & track relevant past keyframes from its experience, and condition on this memory to generate a subtask for a low-level policy to execute.
(2/5)