In RL, what if we could learn from any experience from any policy in a way that is reliable and scalable?
This would be helpful in domains like robotics where new data is expensive.
We introduce Long-horizon Q-learning (LQL) to tackle this https://t.co/1Ckb5ZePyo.
We scale LQL trajectory length on humanoidmaze-giant, OGBench's longest task (thousands of steps, single sparse reward):
1-step TD: 0%
n-step TD: peaks at n=4 (38.4%), drops to 6.1% at n=64
LQL (L=64): 75.7%