ICML 2026: Zero-Shot Off-Policy Learning
Distribution ratios play a central role in off-policy reinforcement learning.
We show that using only one behavioral foundational model, you can obtain its ratio to other policies for free, without any importance sampling or min-max optimization (DICE algorithms).
We show that forward-backward representations store this ratio and can be used for better zero-shot adaptation, solving off-policy issues on the fly.
@zhaisf@geoffreyhinton Even without reading slides, the most probable explanation is that mnist contains all required features for predicting 3 based on other numbers (some of which look similar to 3). It would be more interesting to take most non representative digit and check OOD
@its_vayishu Am i correct that you store smth like a latent buffer that acts as a memory and based on this dynamic buffer predictive model outputs next states? And you show that there is some structure arising from memory?
@willccbb How is this different from Unsipervised Environment design? In RL there are already lots of papers and i assume some folks already managed to apply ideas from UED. Seems like this paper is just the same findings albeit from other perspective
@KeyTryer You can check several replays and see that gemini is the only model which tries to understand what to do through reasoning traces, while other models just return action
@itsolelehmann what is 56%? How it is measured? Agent can basically just find some adverserial solution, get improvement on metric that he chose by himself and you will never notice. dafuq?
Zero-Shot Off-Policy Learning
Behavioral foundation models are pretrained on large, reward-free transition datasets. At deployment time, they can be "prompted" to infer a policy for a new reward in a zero-shot manner, without any fine-tuning.
This falls under offline or off-policy RL: once the inferred policy is executed, its state-action visitation may diverge from the dataset, leading to distribution shift, value overestimation, and other typical off-policy issues.
The missing ingredient is a principled off-policy correctionโspecifically, stationary occupancy (density-ratio) correction.
In this paper, we show that by using ForwardโBackward successor representations, this density-ratio correction can also be performed in a zero-shot manner!
Paper: https://t.co/6myZI8G2Ty
Code: https://t.co/JFZ3fybmBe
@ChenTessler Is this a sample video replicating some motion from training set of AMASS? If so, how it was prompted as inference time? Or agent was trained with this particular option only?
While we are going back to the era of researchโฆ
Introducing ๐๐ฒ๐ฒ๐ฝ ๐๐บ๐ฝ๐ฟ๐ผ๐๐ฒ๐บ๐ฒ๐ป๐ ๐ฆ๐๐ฝ๐ฒ๐ฟ๐๐ถ๐๐ถ๐ผ๐ป (๐๐๐ฆ) โ a new learning method for recursive reasoning.
DIS builds on the elegant Tiny Recursive Model (TRM)(@jm_alexia) but makes recursion radically simpler:
- ๐๐ร ๐ณ๐ฒ๐๐ฒ๐ฟ ๐ณ๐ผ๐ฟ๐๐ฎ๐ฟ๐ฑ ๐ฝ๐ฎ๐๐๐ฒ๐
- ๐ก๐ผ ๐ต๐ฎ๐น๐๐ถ๐ป๐ด ๐บ๐ฒ๐ฐ๐ต๐ฎ๐ป๐ถ๐๐บ
- And a tiny 0.8M-parameter model reaching 24% accuracy on ARC-AGI-1 (@arcprize)
Paper: https://t.co/QM6hNFMm5M
Code: https://t.co/d4nhzvBz4G
๐ฅ Zero-shot generalization is the dream: adapt instantly, no fine-tuning. It's why LLMs blew upโbut it's not just a language modeling thing. Itโs happening in RL too.
๐จ @maxsbob21's new paper dives deep into zero-shot RL under shifting dynamicsโand why current methods break.