Alexey Gorbatovski @AMyashka - Twitter Profile

about 16 hours ago

@sheriyuo Nice work! cool to see the field converging, we posted a related TRB angle just before, via rollout-behavior blending: https://t.co/ButztdPeoF

Alexey Gorbatovski @AMyashka

2 days ago

[1/7] OPD has a simple post-training loop: sample from the student, label with the teacher, repeat. The awkward part is the start. The first rollouts come from the weakest version of the student, and training begins there. TRB Paper: https://t.co/EPjiHZIE3s

5

109

23

103

10K

0

1

0

1

209

Alexey Gorbatovski @AMyashka

1 day ago

@amit_1992 Fair. We don’t derive a universal teacher-size cutoff. Appendix C makes the tradeoff explicit: TRB adds co-residency (~teacher weights + KV) and swaps batched teacher scoring for online decoding. Break-even is hardware/batching/warmup-fraction dependent.

0

2

1

0

98

Alexey Gorbatovski @AMyashka

2 days ago

[1/7] OPD has a simple post-training loop: sample from the student, label with the teacher, repeat. The awkward part is the start. The first rollouts come from the weakest version of the student, and training begins there. TRB Paper: https://t.co/EPjiHZIE3s

5

109

23

103

10K

Alexey Gorbatovski @AMyashka

2 days ago

[7/7] Trust-Region Behavior Blending for On-Policy Distillation (https://t.co/TPX3YEpy6D) Thanks @Daniil_Plyusov @nlp_ceo @a_malakhov11 @borisshapa @kefirski сс @agarwl_ @donglixp @WendaXu2 @SongHan_MIT @qcwntu @lvwerra @_lewtun @_akhaliq

1

4

1

2

298

Alexey Gorbatovski @AMyashka

2 days ago

[6/7] The diagnostics match that story. During warmup, TRB reaches prefixes where the teacher's token-mean entropy is lower. After warmup, the distribution mostly matches vanilla OPD, but the benchmark curve stays higher. The gain seems to come from the warm start.

AMyashka's tweet photo. [6/7] The diagnostics match that story.

During warmup, TRB reaches prefixes where the teacher's token-mean entropy is lower. After warmup, the distribution mostly matches vanilla OPD, but the benchmark curve stays higher.

The gain seems to come from the warm start. https://t.co/dAgu4ofzA0

1

0

317

AMyashka retweeted

Boris Shaposhnikov @borisshapa

about 1 month ago

@icmlconf ESSA #23562 was rejected, but the decision says “the paper’s strengths outweigh its weaknesses.” Reviewers raised scores after rebuttal; requested additions read like camera-ready changes. Possible inconsistency?

borisshapa's tweet photo. @icmlconf ESSA #23562 was rejected, but the decision says “the paper’s strengths outweigh its weaknesses.” Reviewers raised scores after rebuttal; requested additions read like camera-ready changes. Possible inconsistency? https://t.co/9goUHBeBEv

0

3

2

0

699

AMyashka retweeted

Daniil Gavrilov

@kefirski

about 2 months ago

The biggest threat to the #ICML2026 AC discussion phase is these issues with Claude. I feel like the gap between the A and B cohorts in reviewing will become even larger, and the whole conference is going to be filled with LLM slop. The research we deserved.

0

6

2

0

525

Alexey Gorbatovski @AMyashka

2 months ago

🚀I built an interactive game to learn Claude Code extensions. 🙏 Built on top of @official_taches's excellent resource list: https://t.co/Kg6EdI1xGs Try it: https://t.co/vz32hgHaJC Source: https://t.co/SpTvzXLuI9

0

6

1

146

Alexey Gorbatovski @AMyashka

2 months ago

@HeyZaraKhan Yo, I just shipped this today - might be a good fit for this list too 👀 https://t.co/dEGU3zUJoC

0

1

0

264

AMyashka retweeted

George Bredis

@BredisGeorge

3 months ago

Happy to announce that our paper “Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success” has been accepted to #AAMAS 🎉 We introduce VL-DAC, a framework that leverages synthetic environments + RL to improve real-world VLM performance. 🔧 Code: https://t.co/c9gX65o9Gn 🌐 Project page: https://t.co/iE9IassabN

1

5

2

0

414

AMyashka retweeted

Viacheslav Sinii @ummagumm_a

3 months ago

1/ 🧵 Reproducing Anthropic’s “counting manifold” result in open-weight LLMs: do they internally track “chars since last \n” to wrap text consistently? https://t.co/me60hJfrxN

$ummagumm_a's tweet photo. 1/ 🧵 Reproducing Anthropic’s “counting manifold” result in open-weight LLMs: do they internally track “chars since last \n” to wrap text consistently? https://t.co/me60hJfrxN https://t.co/3qUNRH34Kk$

4

219

31

144

24K

Alexey Gorbatovski @AMyashka

4 months ago

@HuggingPapers Appreciate the feature! I posted a short thread with 3 GIFs explaining the tail-miss mechanism + why risk peaks at intermediate N, and the 1-line fix (F-GRPO).

0

38

Alexey Gorbatovski @AMyashka

4 months ago

[8/8] Paper: F-GRPO: Don’t Let Your Policy Learn the Obvious and Forget the Rare (https://t.co/vEcKfJ9RKF) Thanks @borisshapa @ummagumm_a @a_malakhov11 @kefirski cc @shizhediao @winniethexu @archit_sharma97 @aviral_kumar2 @srush_nlp @natolambert @lvwerra @_lewtun @_akhaliq

0

90

Alexey Gorbatovski @AMyashka

4 months ago

[1/8] RL with verifiable rewards often boosts pass@1 but hurts pass@256. In group-relative RL with binary rewards, this isn’t “mysterious collapse” — it’s a sampling effect. We analyze why + propose a 1-line fix: F-GRPO (focal weighting). Paper: https://t.co/vEcKfJ9RKF

1

6

0

579

Alexey Gorbatovski @AMyashka

4 months ago

[7/8] Efficiency: F-GRPO at N=8 matches GRPO at N=32 on diversity with 4× fewer rollouts: math pass@256: 70.3 vs 70.1 OOD pass@256: 63.3 vs 61.7 Also works as F-DAPO and F-CISPO (consistent pass@256 gains).

1

0

80

Alexey Gorbatovski

@AMyashka

Last Seen Users on Sotwe

Trends for you

Most Popular Users