Albert Ge @albert_ge_95 - Twitter Profile

Pinned Tweet

8 months ago

🔭 Towards Extending Open dLLMs to 131k Tokens dLLMs behave differently from AutoRegressive models—they lack attention sinks, making long-context extension tricky. A few simple tweaks go a long way!! ✍️blog https://t.co/Epf2y2Lnsk 💻code https://t.co/c04Cj5iT1y

albert_ge_95's tweet photo. 🔭 Towards Extending Open dLLMs to 131k Tokens
dLLMs behave differently from AutoRegressive models—they lack attention sinks, making long-context extension tricky.
A few simple tweaks go a long way!!
✍️blog https://t.co/Epf2y2Lnsk
💻code https://t.co/c04Cj5iT1y https://t.co/0uZ2F1St13

5

198

48

118

20K

albert_ge_95 retweeted

Gabe Orlanski

@GOrlanski

about 2 months ago

Very excited to announce the v1.0 of SlopCodeBench release: - Doubling the size of the dataset - @harborframework support - scb-check: a CLI that flags slop anti-patterns - Way more model results https://t.co/RQkB8wdzAu https://t.co/36qQR3azeE 🧵

GOrlanski's tweet photo. Very excited to announce the v1.0 of SlopCodeBench release:
- Doubling the size of the dataset
- @harborframework support
- scb-check: a CLI that flags slop anti-patterns
- Way more model results

https://t.co/RQkB8wdzAu
https://t.co/36qQR3azeE

🧵 https://t.co/HvVYoRrpEr

2

64

10

17

14K

albert_ge_95 retweeted

Vasilis Kontonis @vkontonis

2 months ago

Excited to share our new work with an amazing team at @MSFTResearch @yzeng58 @ShivamGarg91462 @ChenLingjiao @tanghao95 @ZiyanWang98 @AhmedHAwadallah @erichorvitz @JohnCLangford @DimitrisPapail

vkontonis's tweet photo. Excited to share our new work with an amazing team at @MSFTResearch @yzeng58 @ShivamGarg91462 @ChenLingjiao @tanghao95 @ZiyanWang98 @AhmedHAwadallah @erichorvitz @JohnCLangford @DimitrisPapail https://t.co/HjG4iHnN8k

2

143

26

110

25K

Albert Ge @albert_ge_95

2 months ago

pretty cool ideas, i was wondering if there's a good interplay btwn fixed state and dynamically growing state

Ali Behrouz

@behrouz_ali

2 months ago

The growing KV-cache of attention is the key component for the long-context understanding of LLMs, but what holds back long-term memory modules (e.g., Titans)? What if we could have the compression power of Titans but with a growing memory similar to Transformers? Memory Caching: A class of architectures that compress the context into a slow growing memory (not as fast as Transformers, but not as static as RNNs), resulting in recurrent neural networks with non-fixed-sized memory (hidden states). Building on this formulation, we present Sparse Selective Caching, an architecture with growing effective memory (similar to attention) but with almost constant inference cost per token (similar to RNNs).

behrouz_ali's tweet photo. The growing KV-cache of attention is the key component for the long-context understanding of LLMs, but what holds back long-term memory modules (e.g., Titans)? What if we could have the compression power of Titans but with a growing memory similar to Transformers?

Memory Caching: A class of architectures that compress the context into a slow growing memory (not as fast as Transformers, but not as static as RNNs), resulting in recurrent neural networks with non-fixed-sized memory (hidden states). Building on this formulation, we present Sparse Selective Caching, an architecture with growing effective memory (similar to attention) but with almost constant inference cost per token (similar to RNNs).

24

1K

167

786

109K

0

10

0

2

2K

Albert Ge @albert_ge_95

2 months ago

interesting ideas, cc @zhu_xuekai

Sharon Li

@SharonYixuanLi

2 months ago

We've been in GRPO-tweaking mode for months (entropy bonuses, clipping hacks, length penalties). But what if the entire objective is wrong? Today, we're releasing LAD (Learning Advantage Distributions), the most elegant rethink of RL for LLM reasoning I've seen this year. #ACL2026 Here's the idea, how it works, and why we think it changes things. 🧵 The problem we kept hitting GRPO, DAPO, RLOO, and many other variants do the same thing at their core: maximize expected reward. And when you do that, your policy can collapse onto a single dominant reasoning path. Entrop regularization can act as a bolt onto the framework, but it doesn't fundamentally fix it from the ground up. The key insight 💡Stop maximizing. Start matching. We reframe the policy update as a distribution matching problem. Instead of pushing toward the single best response, we make the policy's output distribution match the full advantage-weighted target distribution by minimizing an f-divergence between the two (see our theory in Section 3.1). When you match the full advantage distribution, you naturally preserve probability mass across multiple valid reasoning paths. High-advantage responses get upweighted, yes, but the objective also suppresses overconfident probability growth on any single mode. Collapse prevention isn't an afterthought. What validated the theory We tested six divergence families. The result that convinced us we were on the right track: - Strict divergences (Total Variation, Hellinger, Jensen-Shannon) that enforce exact distributional matching consistently outperform weaker ones (such as KL). - The more faithfully you learn the full advantage distribution, the better the reasoning. This is exactly what the framework predicts. The results - In a controlled bandit setting. LAD recovers multiple-mode advantage distributions (see plot below). GRPO fundamentally cannot. This is the clearest demonstration that the paradigm difference is real, not just theoretical - In math and code reasoning tasks across multiple LLM backbones. LAD consistently outperforms GRPO on both accuracy AND generative diversity across benchmarks. Why this matters beyond benchmarks Pass@k scaling: If your model knows 5 valid reasoning paths instead of 1, sampling at inference becomes massively more effective. Simplicity: Instead of stacking "GRPO + entropy hack," you get one principled objective. Diversity preservation comes by design. Paper: https://t.co/Vs8TpzjiGH Code is available; link in the paper. Huge credit to my amazing student @Wendi_Li_, who drove this work, thinks boldly, and made things happen.

SharonYixuanLi's tweet photo. We've been in GRPO-tweaking mode for months (entropy bonuses, clipping hacks, length penalties). But what if the entire objective is wrong?

Today, we're releasing LAD (Learning Advantage Distributions), the most elegant rethink of RL for LLM reasoning I've seen this year. #ACL2026

Here's the idea, how it works, and why we think it changes things. 🧵

The problem we kept hitting
GRPO, DAPO, RLOO, and many other variants do the same thing at their core: maximize expected reward. And when you do that, your policy can collapse onto a single dominant reasoning path. Entrop regularization can act as a bolt onto the framework, but it doesn't fundamentally fix it from the ground up.

The key insight
💡Stop maximizing. Start matching.

We reframe the policy update as a distribution matching problem. Instead of pushing toward the single best response, we make the policy's output distribution match the full advantage-weighted target distribution by minimizing an f-divergence between the two (see our theory in Section 3.1).

When you match the full advantage distribution, you naturally preserve probability mass across multiple valid reasoning paths. High-advantage responses get upweighted, yes, but the objective also suppresses overconfident probability growth on any single mode.

Collapse prevention isn't an afterthought.

What validated the theory
We tested six divergence families. The result that convinced us we were on the right track:
- Strict divergences (Total Variation, Hellinger, Jensen-Shannon) that enforce exact distributional matching consistently outperform weaker ones (such as KL).
- The more faithfully you learn the full advantage distribution, the better the reasoning. This is exactly what the framework predicts.

The results
- In a controlled bandit setting. LAD recovers multiple-mode advantage distributions (see plot below). GRPO fundamentally cannot. This is the clearest demonstration that the paradigm difference is real, not just theoretical

- In math and code reasoning tasks across multiple LLM backbones. LAD consistently outperforms GRPO on both accuracy AND generative diversity across benchmarks.

Why this matters beyond benchmarks
Pass@k scaling: If your model knows 5 valid reasoning paths instead of 1, sampling at inference becomes massively more effective.

Simplicity: Instead of stacking "GRPO + entropy hack," you get one principled objective. Diversity preservation comes by design.

Paper: https://t.co/Vs8TpzjiGH
Code is available; link in the paper.

Huge credit to my amazing student @Wendi_Li_, who drove this work, thinks boldly, and made things happen.

7

376

48

339

32K

1

5

0

10

3K

Albert Ge @albert_ge_95

2 months ago

@dyahadila_ rest easy boss

0

1

0

17

albert_ge_95 retweeted

Chandan Singh @csinva

2 months ago

Here's a short skill that's helped me get clearer technical reports (especially visualizations) from agents: https://t.co/SOoC35Bi6v

csinva's tweet photo. Here's a short skill that's helped me get clearer technical reports (especially visualizations) from agents:

https://t.co/SOoC35Bi6v https://t.co/l5zYx8vzLW

0

16

3

8

918

albert_ge_95 retweeted

Aniket Rege @wregss

2 months ago

🎉 Super stoked to share that our work is accepted to the main conference of ACL 2026!! See you in sunny San Diego 🌞 #ACL2026 #NLProc Paper thread below 🧵

wregss's tweet photo. 🎉 Super stoked to share that our work is accepted to the main conference of ACL 2026!!

See you in sunny San Diego 🌞

#ACL2026 #NLProc

Paper thread below 🧵 https://t.co/sRufHWfZae

1

53

4

2

2K

Albert Ge @albert_ge_95

2 months ago

new scaling law work - predicts a very recently trained model with high precision!

Nicholas Roberts

@nick11roberts

2 months ago

That new LFM2.5-350M is super overtrained, right? And everyone was shocked about how far they pushed it? As it turns out, we have a brand new scaling law for that! 🧵 [1/n]

11

362

53

305

68K

0

22

1

9

4K

Albert Ge @albert_ge_95

3 months ago

whoa new scaling law theory predicted a future model?? you should follow @nick11roberts, he's got more to say very soon!!

Nicholas Roberts

@nick11roberts

3 months ago

The Chinchilla is dead, long live the ___!

4

193

27

121

49K

0

21

2

11

3K

Albert Ge @albert_ge_95

3 months ago

had fun working with gabe on evaluating AI-generated code! check out how measure performance on long-context coding tasks!

Gabe Orlanski

@GOrlanski

3 months ago

We found that agents generate progressively worse code with each iteration. Real developers do not. SlopCodeBench is the only eval that faithfully measures quality degradation on iterative, long-horizon coding tasks. https://t.co/JXGHC4w0bv https://t.co/RQkB8wdzAu 🧵

GOrlanski's tweet photo. We found that agents generate progressively worse code with each iteration. Real developers do not.

SlopCodeBench is the only eval that faithfully measures quality degradation on iterative, long-horizon coding tasks.

https://t.co/JXGHC4w0bv
https://t.co/RQkB8wdzAu
🧵 https://t.co/dOvNkrFv2c

44

724

100

507

184K

1

8

0

1

610

albert_ge_95 retweeted

Shawn Park

@hynwprk

4 months ago

Here's how I solved the Jane Street puzzle from the @dwarkesh_sp podcast!

11

151

8

99

7K

Albert Ge @albert_ge_95

3 months ago

wait a sec... do i have to pay a 3x premium just to turn on speculative decoding??

Cursor @cursor_ai

3 months ago

It's frontier-level at coding, priced at: - Standard: $0.50/M input and $2.50/M output - Fast: $1.50/M input and $7.50/M output

18

1K

33

98

189K

0

9

0

1

1K

Albert Ge @albert_ge_95

3 months ago

aniket presented some cool work recently on long-form video understanding in our reading seminar, please check out his profile!

Aniket Rege @wregss

3 months ago

Hi ML Twitter! My Summer 2026 internship unfortunately fell through last minute 😵‍💫 If your team is looking for interns, I’d love to connect - RTs appreciated 🙏 My website: https://t.co/rNih6t6Emb

17

266

29

69

35K

1

4

0

624

Albert Ge @albert_ge_95

3 months ago

@Avanika15 will do, thanks for the invite!

0

24

Albert Ge @albert_ge_95

3 months ago

always wanted to give claw devices a try, excited to see open source development in this area!

Jon Saad-Falcon

@JonSaadFalcon

3 months ago

Personal AI should run on your personal devices. So, we built OpenJarvis: a personal AI that lives, learns, and works on-device. Try it today and top the OpenJarvis Leaderboard for a chance to win a Mac Mini! Collab w/ @Avanika15, John Hennessy, @HazyResearch, and @Azaliamirh. Details in thread.

JonSaadFalcon's tweet photo. Personal AI should run on your personal devices. So, we built OpenJarvis: a personal AI that lives, learns, and works on-device.

Try it today and top the OpenJarvis Leaderboard for a chance to win a Mac Mini!

Collab w/ @Avanika15, John Hennessy, @HazyResearch, and @Azaliamirh. Details in thread.

38

326

89

231

107K

1

7

0

554

Albert Ge @albert_ge_95

3 months ago

gabe has been doing a lot of detailed analysis and documentation of failure modes of frontier models, check out more below!

Gabe Orlanski

@GOrlanski

3 months ago

https://t.co/IaQjm6xoWD

0

22

7

10

5K

0

5

1

4

910

Albert Ge @albert_ge_95

3 months ago

yes, you can use RL to scale image labelling and data collection! check out brian's new work in this area!

Tzu-Heng (Brian) Huang @zihengh1

3 months ago

Scaling expert-annotated image captions is expensive. Supervised distillation from VLMs helps but has a diversity ceiling: models memorize the teacher's style and generalize poorly. Can RL fix this without a verifiable "ground truth"? Introducing RubiCap: https://t.co/wRnZcIm4xf

zihengh1's tweet photo. Scaling expert-annotated image captions is expensive. Supervised distillation from VLMs helps but has a diversity ceiling: models memorize the teacher's style and generalize poorly. Can RL fix this without a verifiable "ground truth"?

Introducing RubiCap: https://t.co/wRnZcIm4xf https://t.co/N0xdXsSAdA

2

50

18

21

12K

0

9

2

5

2K

Albert Ge @albert_ge_95

3 months ago

cool work in the continual learning area! train a second MLP on the side and still get good performance while minimizing forgetting!

Dyah Adila 🦄 @dyahadila_

3 months ago

🌟 Psyched to finally share our paper from my internship w/ Google Research last summer: "Grow, Don't Overwrite: Fine-tuning Without Forgetting" A very simple method that matches full fine-tuning on new tasks with almost zero forgetting. 📄https://t.co/XeoxSBoJOY 🧵 below

dyahadila_'s tweet photo. 🌟 Psyched to finally share our paper from my internship w/ Google Research last summer: "Grow, Don't Overwrite: Fine-tuning Without Forgetting"

A very simple method that matches full fine-tuning on new tasks with almost zero forgetting.

📄https://t.co/XeoxSBoJOY
🧵 below https://t.co/B1xeHS4fNg

3

101

24

46

9K

1

12

0

6

2K

albert_ge_95 retweeted

Ian Li

@IanLi1118

3 months ago

One of the biggest promises of Diffusion LLMs is parallel generation: predicting multiple tokens at once to bypass the sequential bottleneck of autoregressive models. However, parallel generation comes with a price. For example: Should the sentence “He is from [MASK] [MASK]” be filled with [New] [York] or [San] [Diego]? If a diffusion model predicts both at the exact same time, it assumes independence and may produce... [San] [York]. 🤦‍♂️ We argue this arises from a structural misspecification: models are restricted to fully factorized outputs because parameterizing the full joint distribution would require a prohibitively massive output head. This is the Factorization Barrier crippling parallel generation. Here is how we broke it with CoDD.

IanLi1118's tweet photo. One of the biggest promises of Diffusion LLMs is parallel generation: predicting multiple tokens at once to bypass the sequential bottleneck of autoregressive models.

However, parallel generation comes with a price. For example:

Should the sentence “He is from [MASK] [MASK]” be filled with [New] [York] or [San] [Diego]?

If a diffusion model predicts both at the exact same time, it assumes independence and may produce... [San] [York]. 🤦‍♂️

We argue this arises from a structural misspecification: models are restricted to fully factorized outputs because parameterizing the full joint distribution would require a prohibitively massive output head.

This is the Factorization Barrier crippling parallel generation. Here is how we broke it with CoDD.

8

311

30

211

23K

albert_ge_95 retweeted

Kangwook Lee

@Kangwook_Lee

3 months ago

https://t.co/6beJPutjiG

39

3K

304

6K

1M

Albert Ge

@albert_ge_95

Last Seen Users on Sotwe

Trends for you

Most Popular Users