(1/n)
With over 1,300 citations, MBPO is often cited as proof that model based RL beats model free methods. In https://t.co/xq3WXslh67 we showed it often completely fails in DeepMind Control. In our new work, Fixing That Free Lunch (FTFL), we explain why and make it succeed.
(12/12)
In summary, FTFL turns MBPO’s synthetic-data failures into successes and shows how even seemingly similar environment structure can shape algorithmic reliability.
Full paper: https://t.co/UjkUVmvg42
(1/n)
With over 1,300 citations, MBPO is often cited as proof that model based RL beats model free methods. In https://t.co/xq3WXslh67 we showed it often completely fails in DeepMind Control. In our new work, Fixing That Free Lunch (FTFL), we explain why and make it succeed.
(11/n)
FTFL shows that understanding when and why algorithms fail is as important as improving their averages. We hope this motivates the RL community to build mappings between environment structure and algorithmic choices as a step toward more generally reliable methods.
@GuanyaShi@Caltech@lschmidt3 Totally resonates with our work (arXiv:2412.14312), we show that Dyna-style tweaks - dominant in Gym - consistently hurt performance in DMC despite both using Mujoco. Adding them to off-policy makes it worse, not better. Maybe we’ve overfit to Gym more than we realized.
(10/10) In summary, Open AI Gym and DMC are equally conventional testbeds that share a common physics backend (Mujoco). There is no 'good' reason for MBPO and ALM to largely fail in DMC, but they do. We encourage readers to check out our paper for more:
https://t.co/xq3WXskJgz
You might be surprised to learn that modern RL favors Dyna-style model-based algorithms for their sample efficiency, yet they can both require up to 40x more wall clock time to train and significantly underperform simple model-free methods across diverse benchmarks.
(9/n) Not only that, but at the time of this post MBPO has >1000 citations and a reproducibility study at Neurips. Despite this, only one paper has noted this performance gap, and it was only noted across hopper tasks in Gym and DMC.