Maxim Bobrin @maxsbob21 - Twitter Profile

Maxim Bobrin @maxsbob21

6 days ago

Lots of other interesting insights were made during this work. Thanks @machinestein, see everyone interested in Seoul!

Arip

@machinestein

6 days ago

ICML 2026: Zero-Shot Off-Policy Learning Distribution ratios play a central role in off-policy reinforcement learning. We show that using only one behavioral foundational model, you can obtain its ratio to other policies for free, without any importance sampling or min-max optimization (DICE algorithms). We show that forward-backward representations store this ratio and can be used for better zero-shot adaptation, solving off-policy issues on the fly.

3

325

40

266

20K

1

4

0

177

Maxim Bobrin @maxsbob21

8 days ago

@zhaisf @geoffreyhinton Even without reading slides, the most probable explanation is that mnist contains all required features for predicting 3 based on other numbers (some of which look similar to 3). It would be more interesting to take most non representative digit and check OOD

2

9

0

951

Maxim Bobrin @maxsbob21

18 days ago

@its_vayishu Am i correct that you store smth like a latent buffer that acts as a memory and based on this dynamic buffer predictive model outputs next states? And you show that there is some structure arising from memory?

1

0

45

Maxim Bobrin @maxsbob21

28 days ago

@willccbb How is this different from Unsipervised Environment design? In RL there are already lots of papers and i assume some folks already managed to apply ideas from UED. Seems like this paper is just the same findings albeit from other perspective

2

0

189

Maxim Bobrin @maxsbob21

2 months ago

@Amank1412 Most ppl would say that they are optimizing their work, but in reality will spend more free time for soom scrolling

0

5

Maxim Bobrin @maxsbob21

3 months ago

@heyrimsha can we pla stop already? There are a lot of services for those kind of things… Research community needs to heal asap

2

0

321

Maxim Bobrin @maxsbob21

3 months ago

@daniel_mac8 they tested this on public games. Private is much harder

0

1

0

34

Maxim Bobrin @maxsbob21

3 months ago

@agenticasdk gg

0

86

Maxim Bobrin @maxsbob21

3 months ago

@KeyTryer You can check several replays and see that gemini is the only model which tries to understand what to do through reasoning traces, while other models just return action

0

279

Maxim Bobrin @maxsbob21

3 months ago

@loveofdoing How you came up with this approach? somekind of blogpost would be beneficial

0

581

Maxim Bobrin @maxsbob21

3 months ago

@ChenTessler Seems like there is no method that can do this even in simulation without spending hours on RL only for this single level

0

10

Maxim Bobrin @maxsbob21

3 months ago

@pengzhangzhi1 @Anru_Zhang @AlexanderTong7 Where to find recordings?

1

0

86

Maxim Bobrin @maxsbob21

3 months ago

@itsolelehmann what is 56%? How it is measured? Agent can basically just find some adverserial solution, get improvement on metric that he chose by himself and you will never notice. dafuq?

0

152

Maxim Bobrin @maxsbob21

3 months ago

@mikeknoop But what about NetHack challenge? It measures the same skills ARC-AGI-3 aims to check

0

56

Maxim Bobrin @maxsbob21

3 months ago

@deliprao @karpathy As far as i understand, the program.md needs to be different for other domains (e.g for RL/robotics)

1

2

0

251

maxsbob21 retweeted

Arip

@machinestein

3 months ago

Zero-Shot Off-Policy Learning Behavioral foundation models are pretrained on large, reward-free transition datasets. At deployment time, they can be "prompted" to infer a policy for a new reward in a zero-shot manner, without any fine-tuning. This falls under offline or off-policy RL: once the inferred policy is executed, its state-action visitation may diverge from the dataset, leading to distribution shift, value overestimation, and other typical off-policy issues. The missing ingredient is a principled off-policy correction—specifically, stationary occupancy (density-ratio) correction. In this paper, we show that by using Forward–Backward successor representations, this density-ratio correction can also be performed in a zero-shot manner! Paper: https://t.co/6myZI8G2Ty Code: https://t.co/JFZ3fybmBe

2

186

37

142

12K

Maxim Bobrin @maxsbob21

6 months ago

@ChenTessler Is this a sample video replicating some motion from training set of AMASS? If so, how it was prompted as inference time? Or agent was trained with this particular option only?

1

0

250

maxsbob21 retweeted

Arip

@machinestein

6 months ago

While we are going back to the era of research… Introducing 𝗗𝗲𝗲𝗽 𝗜𝗺𝗽𝗿𝗼𝘃𝗲𝗺𝗲𝗻𝘁 𝗦𝘂𝗽𝗲𝗿𝘃𝗶𝘀𝗶𝗼𝗻 (𝗗𝗜𝗦) – a new learning method for recursive reasoning. DIS builds on the elegant Tiny Recursive Model (TRM)(@jm_alexia) but makes recursion radically simpler: - 𝟏𝟖× 𝗳𝗲𝘄𝗲𝗿 𝗳𝗼𝗿𝘄𝗮𝗿𝗱 𝗽𝗮𝘀𝘀𝗲𝘀 - 𝗡𝗼 𝗵𝗮𝗹𝘁𝗶𝗻𝗴 𝗺𝗲𝗰𝗵𝗮𝗻𝗶𝘀𝗺 - And a tiny 0.8M-parameter model reaching 24% accuracy on ARC-AGI-1 (@arcprize) Paper: https://t.co/QM6hNFMm5M Code: https://t.co/d4nhzvBz4G

machinestein's tweet photo. While we are going back to the era of research…
Introducing 𝗗𝗲𝗲𝗽 𝗜𝗺𝗽𝗿𝗼𝘃𝗲𝗺𝗲𝗻𝘁 𝗦𝘂𝗽𝗲𝗿𝘃𝗶𝘀𝗶𝗼𝗻 (𝗗𝗜𝗦) – a new learning method for recursive reasoning.

DIS builds on the elegant Tiny Recursive Model (TRM)(@jm_alexia) but makes recursion radically simpler:
- 𝟏𝟖× 𝗳𝗲𝘄𝗲𝗿 𝗳𝗼𝗿𝘄𝗮𝗿𝗱 𝗽𝗮𝘀𝘀𝗲𝘀
- 𝗡𝗼 𝗵𝗮𝗹𝘁𝗶𝗻𝗴 𝗺𝗲𝗰𝗵𝗮𝗻𝗶𝘀𝗺
- And a tiny 0.8M-parameter model reaching 24% accuracy on ARC-AGI-1 (@arcprize)

Paper: https://t.co/QM6hNFMm5M

Code: https://t.co/d4nhzvBz4G

1

22

6

12

5K

maxsbob21 retweeted

Ilya Zisman @suessmannn

about 1 year ago

🔥 Zero-shot generalization is the dream: adapt instantly, no fine-tuning. It's why LLMs blew up—but it's not just a language modeling thing. It’s happening in RL too. 🚨 @maxsbob21's new paper dives deep into zero-shot RL under shifting dynamics—and why current methods break.

suessmannn's tweet photo. 🔥 Zero-shot generalization is the dream: adapt instantly, no fine-tuning. It's why LLMs blew up—but it's not just a language modeling thing. It’s happening in RL too.

🚨 @maxsbob21's new paper dives deep into zero-shot RL under shifting dynamics—and why current methods break. https://t.co/hdWGz3lZ60

4

144

20

103

15K

Maxim Bobrin

@maxsbob21

Last Seen Users on Sotwe

Trends for you

Most Popular Users