Yupei Du @YupeiDu - Twitter Profile

Pinned Tweet

7 months ago

How do language models memorize noise while reason impressively well? Our #EMNLP2025 (poster, Nov 5, 11:00-12:30, Hall C) paper shows that memorization reuses internal mechanisms of generalization, even when they are not related to each other! https://t.co/LD2OJmZBm5

YupeiDu's tweet photo. How do language models memorize noise while reason impressively well?

Our #EMNLP2025 (poster, Nov 5, 11:00-12:30, Hall C) paper shows that memorization reuses internal mechanisms of generalization, even when they are not related to each other!

https://t.co/LD2OJmZBm5 https://t.co/z5xwUcIgSU

1

25

6

2K

YupeiDu retweeted

Nando de Freitas

@NandoDF

2 days ago

End to end learning — We chose no shortcuts so we could learn, and build the knowledge and infrastructure to create many more models in years to come.

7

107

11

38

10K

YupeiDu retweeted

OpenAI

@OpenAI

18 days ago

Today, we share a breakthrough on the planar unit distance problem, a famous open question first posed by Paul Erdős in 1946. For nearly 80 years, mathematicians believed the best possible solutions looked roughly like square grids. An OpenAI model has now disproved that belief, discovering an entirely new family of constructions that performs better. This marks the first time AI has autonomously solved a prominent open problem central to a field of mathematics.

1K

27K

4K

9K

14M

YupeiDu retweeted

elie

@eliebakouch

about 1 month ago

Deepseek V4 Pro is the biggest open model ever with 1.6T total 49B active, trained on 33T tokens, 1M context, with 2 new attention mechanisms, Muon, mHC, open source kernels, FP4 QAT, MIT license and with one of the best tech repot of the year

eliebakouch's tweet photo. Deepseek V4 Pro is the biggest open model ever with 1.6T total 49B active, trained on 33T tokens, 1M context, with 2 new attention mechanisms, Muon, mHC, open source kernels, FP4 QAT, MIT license and with one of the best tech repot of the year https://t.co/szEkdvCKKy

20

1K

166

478

133K

Who to follow

Xiang Yue

@xiangyue96

Superintelligence @Meta. Training & evaluating foundation models. Previously @LTIatCMU @osunlp. Opinions are my own.

Simone Balloccu

@simoneballoccu

(he/him) ExpNLP lab leader @TUDarmstadt. Researching AI w.r.t human evaluation, behaviour change, safety and controllability, expert domains. Opnions my own.

Zhenwen Liang

@LiangZhenwen

Math, Coding Agent Post Training Scientist in LLM, at Miromind, Seattle. Previous at Salesforce AI Research, AI2 and Tencent AI Lab.

YupeiDu retweeted

Hayden Prairie @hayden_prairie

about 2 months ago

We’ve been thinking a lot about scaling laws, wondering if there is a more effective way to scale FLOPs without increasing parameters. Turns out the answer is YES – by looping blocks of layers during training. We find that predictable scaling laws exist for layer looping, allowing us to use looping to achieve the quality of a Transformer twice the size. Our scaling laws suggest that for a fixed parameter budget, data and looping should be increased in tandem! 🧵👇

hayden_prairie's tweet photo. We’ve been thinking a lot about scaling laws, wondering if there is a more effective way to scale FLOPs without increasing parameters.

Turns out the answer is YES – by looping blocks of layers during training. We find that predictable scaling laws exist for layer looping, allowing us to use looping to achieve the quality of a Transformer twice the size.

Our scaling laws suggest that for a fixed parameter budget, data and looping should be increased in tandem!

🧵👇

41

1K

179

1K

294K

YupeiDu retweeted

Yuekun Yao @yuekun_yao

about 2 months ago

Claude Mythos is suspected of being a Looped transformer (LT), but why are LT-based LLMs so powerful? Our new finding: LT can perform implicit reasoning over their parametric knowledge, unlocking generalization to complex and unfamiliar questions compared to transformers ⤵️

yuekun_yao's tweet photo. Claude Mythos is suspected of being a Looped transformer (LT), but why are LT-based LLMs so powerful?

Our new finding: LT can perform implicit reasoning over their parametric knowledge, unlocking generalization to complex and unfamiliar questions compared to transformers ⤵️ https://t.co/FQuraEuEk9

16

966

155

997

187K

YupeiDu retweeted

Michael Hahn @mhahn29

2 months ago

We have 1-2 more extra spots due to new funding -- apply by end of April!

5

161

31

102

28K

YupeiDu retweeted

Michael Hahn @mhahn29

3 months ago

Excited about this advance in Transformer theory, giving us detailed understanding of when Transformers generalize at symbolic reasoning (and when not).

0

57

10

33

7K

YupeiDu retweeted

John Schulman

@johnschulman2

3 months ago

Models that are great at calibrated predictions will be transformative for decision making. Excited about Mantic's work and proud they're using Tinker. Their new blog post digs into their methodology and findings.

14

403

33

155

110K

YupeiDu retweeted

Yilong Chen

@Yichen4NLP

3 months ago

We introduce MoUE. A new MoE paradigm boosts base-model performance by up to 1.3 points from scratch and up to 4.2 points on average, without increasing either activated parameters or total parameters. The main idea is simple: a sufficiently wide MoE layer with recursive reuse can be treated as a strict generalization of standard MoE. https://t.co/UTagOXUD0y https://t.co/LsnL5GEIaX #MoE #LLM #MixtureOfExperts #SparseModels #ScalingLaws #Modularity #UniversalTransformers #RecursiveComputation #ContinualPretraining

6

113

16

110

33K

YupeiDu retweeted

Albert Gu

@_albertgu

4 months ago

one question we got a lot about H-Net is how it compares to MoE. the idea is that both of them can be seen as dynamic or sparse computation methods that can adjust the FLOPs-to-parameter ratio (in H-Net, via the chunking ratio; in MoE, via number of experts). in other words for a fixed FLOP budget, both methods can increase parameter count by sparsely activating parameters only on some tokens. in the original paper, we compared these while matching both FLOPs *and* parameters and showed that H-Net >> MoE on byte-level language modeling. of course, H-Nets can be applied to any data so an open question remained about whether it's still better than MoE when applied directly to standard tokens instead of bytes. this paper answers the question affirmatively: H-Net seems to still consistently outperform MoE in resource-matched settings! they show this for standard language modeling (on top of BPE tokens) as well as in multimodal (vision-language models). there are a lot of other interesting results on ablations inside the architecture here the results are cool, but the weirdest part of this paper is how hard it tried to avoid stating what they did plainly: it's literally H-Net on tokens. i think being more transparent would have helped rather than diminish the paper's results by making what they did more accessible to the community, whereas the way it is written is a bit confusing 🤷‍♂️

_albertgu's tweet photo. one question we got a lot about H-Net is how it compares to MoE. the idea is that both of them can be seen as dynamic or sparse computation methods that can adjust the FLOPs-to-parameter ratio (in H-Net, via the chunking ratio; in MoE, via number of experts). in other words for a fixed FLOP budget, both methods can increase parameter count by sparsely activating parameters only on some tokens.

in the original paper, we compared these while matching both FLOPs *and* parameters and showed that H-Net >> MoE on byte-level language modeling. of course, H-Nets can be applied to any data so an open question remained about whether it's still better than MoE when applied directly to standard tokens instead of bytes.

this paper answers the question affirmatively: H-Net seems to still consistently outperform MoE in resource-matched settings! they show this for standard language modeling (on top of BPE tokens) as well as in multimodal (vision-language models). there are a lot of other interesting results on ablations inside the architecture here

the results are cool, but the weirdest part of this paper is how hard it tried to avoid stating what they did plainly: it's literally H-Net on tokens. i think being more transparent would have helped rather than diminish the paper's results by making what they did more accessible to the community, whereas the way it is written is a bit confusing 🤷‍♂️

7

200

24

143

13K

YupeiDu retweeted

Karan Dalal

@karansdalal

5 months ago

LLM memory is considered one of the hardest problems in AI. All we have today are endless hacks and workarounds. But the root solution has always been right in front of us. Next-token prediction is already an effective compressor. We don’t need a radical new architecture. The missing piece is to continue training the model at test-time, using context as training data. Our full release of End-to-End Test-Time Training (TTT-E2E) with @NVIDIAAI, @AsteraInstitute, and @StanfordAILab is now available. Blog: https://t.co/woCpiIrq0T Arxiv: https://t.co/3VkFlS3wx3 This has been over a year in the making with @arnuvtandon and an incredible team.

karansdalal's tweet photo. LLM memory is considered one of the hardest problems in AI.

All we have today are endless hacks and workarounds. But the root solution has always been right in front of us.

Next-token prediction is already an effective compressor. We don’t need a radical new architecture. The missing piece is to continue training the model at test-time, using context as training data.

Our full release of End-to-End Test-Time Training (TTT-E2E) with @NVIDIAAI, @AsteraInstitute, and @StanfordAILab is now available.

Blog: https://t.co/woCpiIrq0T
Arxiv: https://t.co/3VkFlS3wx3

This has been over a year in the making with @arnuvtandon and an incredible team.

91

2K

321

2K

575K

YupeiDu retweeted

Shashwat Goel

@ShashwatGoel7

5 months ago

Excited to announce the OpenForecaster project, we train models at reasoning predict the future. We won't get to AGI by maxxing STEM exam and coding benchmarks. That's not what most humans reason about in their day to day. Instead, we reason about uncertainty to make decisions, using our world-model of how society evolves. Yet, there weren't any large-scale datasets to train AI for this form of reasoning. Until now. We release OpenForesight, a training dataset of 52k forecasting questions, made from global news. Our recipe is fully automated, and can be repeated for more, newer data at low cost. Using it, we RL trained an 8B model, and it became competitive with much larger models like GPT-OSS-120B across benchmarks and metrics. And we want to keep building on this, in public. Our paper with full details, dataset, code etc. in 🧵 Blog: https://t.co/UhxbxmM4Te

ShashwatGoel7's tweet photo. Excited to announce the OpenForecaster project, we train models at reasoning predict the future.

We won't get to AGI by maxxing STEM exam and coding benchmarks. That's not what most humans reason about in their day to day.

Instead, we reason about uncertainty to make decisions, using our world-model of how society evolves. Yet, there weren't any large-scale datasets to train AI for this form of reasoning. Until now.

We release OpenForesight, a training dataset of 52k forecasting questions, made from global news. Our recipe is fully automated, and can be repeated for more, newer data at low cost.

Using it, we RL trained an 8B model, and it became competitive with much larger models like GPT-OSS-120B across benchmarks and metrics.

And we want to keep building on this, in public. Our paper with full details, dataset, code etc. in 🧵

Blog: https://t.co/UhxbxmM4Te

20

387

53

257

45K

YupeiDu retweeted

Songlin Yang

@SonglinYang4

5 months ago

the residual stream should be viewed as a recurrence, and insights from the RNN literature should apply here

8

208

16

74

22K

YupeiDu retweeted

Zeyuan Allen-Zhu, Sc.D.

@ZeyuanAllenZhu

6 months ago

(1/N)🚀Today we launch two tightly connected milestones in the Physics of LM series: a sharpened Part 4.1 (v2.0) and a brand new Part 4.2 — together forming a clear, reproducible, textbook-style reference for principled architecture research. Part 4.1 introduced a synthetic pretraining playground — our Galileo experiment for LLMs🍎. Our v2.0 strengthens it with Gated DeltaNet (GDN) and stricter alignment, building an even cleaner “Pisa tower” for testing architectural limits. Part 4.2 shows these synthetic predictions resonate in reality 🌍 — across 1–8B / 1T-token pretraining — confirming which design principles actually matter. Together, Parts 4.1 and 4.2 bring the synthetic and real worlds into surprising agreement 🤝— one more step toward a more scientific understanding of LLM architectures. If you’re curious about: 🧠why some models reason deeper ⚙️ why linear models struggle at retrieval 🎶why a tiny horizontal mixer (Canon) changes everything … this release ties it all together. (Links at the end)

23

1K

160

1K

193K

YupeiDu retweeted

Zachary Horvitz

@zachary_horvitz

7 months ago

✨Masked Diffusion Language Models✨ are great for reasoning, but not just for the reasons you think! Fast parallel decoding? 🤔 Any-order decoding? 🤨 Plot twist: MDLMs offer A LOT MORE for inference and post-training! 🎢🧵

zachary_horvitz's tweet photo. ✨Masked Diffusion Language Models✨ are great for reasoning, but not just for the reasons you think!

Fast parallel decoding? 🤔 Any-order decoding? 🤨

Plot twist: MDLMs offer A LOT MORE for inference and post-training! 🎢🧵 https://t.co/SPwKMDnv8y

4

163

38

99

22K

YupeiDu retweeted

Ronak Malde

@rronak_

6 months ago

My takeaways from Neurips 1. Continual learning. To support this next frontier, we’re going to need new architectures, new reward functions, new data sources, and new revenue models. 2. Neolabs. Frontier research for risky bets is being shared across multiple companies now 3. San Diego has way better weather than SF 😭

27

911

37

521

293K

Yupei Du @YupeiDu

7 months ago

Also come visit another work of ours (poster, Nov 6, 16:30-18:00, Hall C)! https://t.co/OLpSFTCkcI

Yuekun Yao @yuekun_yao

about 1 year ago

Can language models learn implicit reasoning without chain-of-thought? Our new paper shows: Yes, LMs can learn k-hop reasoning; however, it comes at the cost of an exponential increase in training data and linear growth in model depth as k increases. https://t.co/1CGTRJGnGQ

yuekun_yao's tweet photo. Can language models learn implicit reasoning without chain-of-thought?

Our new paper shows: Yes, LMs can learn k-hop reasoning; however, it comes at the cost of an exponential increase in training data and linear growth in model depth as k increases.

https://t.co/1CGTRJGnGQ https://t.co/qPTMSZJcos

1

7

2

866

0

257

Yupei Du @YupeiDu

7 months ago

How do language models memorize noise while reason impressively well? Our #EMNLP2025 (poster, Nov 5, 11:00-12:30, Hall C) paper shows that memorization reuses internal mechanisms of generalization, even when they are not related to each other! https://t.co/LD2OJmZBm5

1

25

6

2K

Yupei Du @YupeiDu

7 months ago

Come talk to me (not limited to the paper!!) at EMNLP! This is done partly during a wonderful visit to @MaiNLPlab. Many thanks to my amazing collaborators: @PMMondorf @eclipto93 @yuekun_yao Robert Litschko, @barbara_plank

1

0

191

Yupei Du

@YupeiDu

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users