Chenwei Cui

@ccui42

CS PhD Student @ Kerner Lab @hannah_kerner @SCAI_ASU. I am interested in the science of machine learning.

Tempe, AZ

Joined March 2023

148 Following

275 Followers

5.5K Posts

Pinned Tweet

Chenwei Cui @ccui42

4 months ago

Introducing Multi-Head LatentMoE 🚀 Turns out, making NVIDIA's LatentMoE [1] multi-head further unlocks O(1), balanced, and deterministic communication. Our insight: Head Parallel; Move routing from before all-to-all to after. Token duplication happens locally. Always uniform, always deterministic. It works orthogonally to EP as a new dimension of parallelism. For example, use HP for intra-cluster all-to-all as a highway, then use EP locally. We propose FlashAttention-like routing and expert computation, both exact, IO-aware, and constant memory. This is to handle the increased number of sub-tokens. Results: - We replicate LatentMoE and confirm it is indeed faster than MoE, with matching model performance. (See Design Principle IV in [1]) - Up to 1.61x faster training than MoE+EP with identical model performance. - Higher model performance while still 1.11x faster with doubled granularity. 📄 Paper: https://t.co/re5ludi0mB 💻 Code: https://t.co/8pHdtN3Z4i [1] Elango et al., "LatentMoE: Toward Optimal Accuracy per FLOP and Parameter in Mixture of Experts", 2026. https://t.co/cNmJ8tchTF

ccui42's tweet photo. Introducing Multi-Head LatentMoE 🚀

Turns out, making NVIDIA's LatentMoE [1] multi-head further unlocks O(1), balanced, and deterministic communication.

Our insight: Head Parallel; Move routing from before all-to-all to after. Token duplication happens locally. Always uniform, always deterministic.

It works orthogonally to EP as a new dimension of parallelism. For example, use HP for intra-cluster all-to-all as a highway, then use EP locally.

We propose FlashAttention-like routing and expert computation, both exact, IO-aware, and constant memory. This is to handle the increased number of sub-tokens.

Results:
- We replicate LatentMoE and confirm it is indeed faster than MoE, with matching model performance. (See Design Principle IV in [1])
- Up to 1.61x faster training than MoE+EP with identical model performance.
- Higher model performance while still 1.11x faster with doubled granularity.

📄 Paper: https://t.co/re5ludi0mB
💻 Code: https://t.co/8pHdtN3Z4i

[1] Elango et al., "LatentMoE: Toward Optimal Accuracy per FLOP and Parameter in Mixture of Experts", 2026. https://t.co/cNmJ8tchTF

560

410

96K

ccui42 retweeted

Rosinality @rosinality

about 21 hours ago

https://t.co/Lhvj3dimz9 Training an RNN in parallel by RNN cell to predict a compressed state that predicts future outputs given past inputs. But as this state is distilled from time-parallel models its structure could be similar to that.

rosinality's tweet photo. https://t.co/Lhvj3dimz9

Training an RNN in parallel by RNN cell to predict a compressed state that predicts future outputs given past inputs. But as this state is distilled from time-parallel models its structure could be similar to that. https://t.co/Qn2WMA1Hoo

ccui42 retweeted

Nathan Lambert

@natolambert

1 day ago

We have another 65 page frontier model report from Nvidia to read @eliebakouch @stochasticchasm and gang

671

378

50K

ccui42 retweeted

Joël Niklaus

@joelniklaus

7 days ago

Synthetic Data Playbook update! TLDR: Most diversity scores don't predict synthetic-data quality. G-Vendi is the exception. I added four diversity metrics across 83 synthetic-pretraining experiments: Vendi Score, mean intra-set cosine similarity, and near-duplicate rate (all on bge-base-en-v1.5 sentence embeddings), plus G-Vendi from the Prismatic Synthesis paper. The selling point if any of them works: rank a candidate rephrasing recipe from 1000 generated samples, instead of generating millions of tokens and training a student model end-to-end. The three semantic metrics fall apart on aggregate downstream performance. Vendi correlates negatively (ρ = −0.21), mean cosine similarity positively (ρ = +0.23, but that's just −Vendi by construction), near-duplicate rate essentially zero. None of them is a usable overall predictor. G-Vendi flips the script. Instead of asking "do these samples look different?", it asks "would training on each of them push the model's weights in different directions?". You backprop the cross-entropy loss through a frozen proxy LLM (I used Qwen3-0.6B), sketch the resulting gradient, L2-normalise and then take the Vendi Score in that feature space. Across the same 83 experiments, G-Vendi correlates positively with the macro score (ρ = +0.26) and is the single best predictor for reasoning of any metric I tested (ρ = +0.38), beating every quality classifier and semantic-diversity variant. But none of this displaces DCLM-difference. Even G-Vendi explains only ~7% of variance on the macro score, while DCLM-difference (output minus input quality score) still wins overall (ρ = +0.61). Diversity metrics complement quality classifiers, they don't replace them. The shortcut around training-and-evaluating still doesn't exist.

joelniklaus's tweet photo. Synthetic Data Playbook update!

TLDR: Most diversity scores don't predict synthetic-data quality. G-Vendi is the exception.

I added four diversity metrics across 83 synthetic-pretraining experiments: Vendi Score, mean intra-set cosine similarity, and near-duplicate rate (all on bge-base-en-v1.5 sentence embeddings), plus G-Vendi from the Prismatic Synthesis paper. The selling point if any of them works: rank a candidate rephrasing recipe from 1000 generated samples, instead of generating millions of tokens and training a student model end-to-end.

The three semantic metrics fall apart on aggregate downstream performance. Vendi correlates negatively (ρ = −0.21), mean cosine similarity positively (ρ = +0.23, but that's just −Vendi by construction), near-duplicate rate essentially zero. None of them is a usable overall predictor.

G-Vendi flips the script. Instead of asking "do these samples look different?", it asks "would training on each of them push the model's weights in different directions?". You backprop the cross-entropy loss through a frozen proxy LLM (I used Qwen3-0.6B), sketch the resulting gradient, L2-normalise and then take the Vendi Score in that feature space. Across the same 83 experiments, G-Vendi correlates positively with the macro score (ρ = +0.26) and is the single best predictor for reasoning of any metric I tested (ρ = +0.38), beating every quality classifier and semantic-diversity variant.

But none of this displaces DCLM-difference. Even G-Vendi explains only ~7% of variance on the macro score, while DCLM-difference (output minus input quality score) still wins overall (ρ = +0.61). Diversity metrics complement quality classifiers, they don't replace them. The shortcut around training-and-evaluating still doesn't exist.

ccui42 retweeted

Zephyr

@zephyr_z9

6 days ago

613

378

170K

ccui42 retweeted

ClaudeDevs

@ClaudeDevs

7 days ago

Sometimes system instructions need to be updated during the course of a conversation or session. But editing the top-level system prompt field invalidates the prompt cache for everything that follows.

212

31K

ccui42 retweeted

Alex Albert

@alexalbert__

8 days ago

We put a lot of work into calibrating thinking effort for Opus 4.8. As you're trying out the model, if you do run into any examples of it still over/under thinking, please flag it to us!

431

38K

ccui42 retweeted

David

@DavidSHolz

9 days ago

Most researchers agree that autoregression is best when memory bandwidth is cheap and diffusion is best when FLOPS are cheap. They also admit the future of compute is all FLOPS because memory scaling is hard and scaling FLOPS is easy. So why not go all in on diffusion????

625

234K

ccui42 retweeted

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

11 days ago

Language Models Need Sleep "Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consolidation mechanism in which a model periodically converts recent context into persistent fast weights before clearing its key-value cache." "increasing sleep duration N for our models improves performance, with the largest gains on examples that require deeper reasoning."

iScienceLuvr's tweet photo. Language Models Need Sleep

"Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consolidation mechanism in which a model periodically converts recent context into persistent fast weights before clearing its key-value cache."

"increasing sleep duration N for our models improves performance, with the largest gains on examples that require deeper reasoning."

916

146

718

66K

ccui42 retweeted

Tiberiu Mușat

@Tiberiu_Musat_

10 days ago

Why does deep learning generalize? What does weight decay really do? Can algorithmic information theory address these questions? In my latest preprint, I give a proof that the minimum neural weight norm matches the minimum program length (aka Kolmogorov Complexity), up to a logarithmic factor. In other words, the neural network with the smallest possible weight norm (that fits the data) must encode the shortest program (that fits the data). The result only holds for fixed-precision neural nets: infinite precision nets can store infinite information with finite (small) weights. https://t.co/eMZIGQDf2f

Tiberiu_Musat_'s tweet photo. Why does deep learning generalize? What does weight decay really do? Can algorithmic information theory address these questions?

In my latest preprint, I give a proof that the minimum neural weight norm matches the minimum program length (aka Kolmogorov Complexity), up to a logarithmic factor. In other words, the neural network with the smallest possible weight norm (that fits the data) must encode the shortest program (that fits the data).

The result only holds for fixed-precision neural nets: infinite precision nets can store infinite information with finite (small) weights.

https://t.co/eMZIGQDf2f

153

143K

ccui42 retweeted

Alex Volkov

@altryne

11 days ago

Random Q: Anyone know what happened to @magicailabs , who supposedly cracked the "100 Million context windows" and got investments by Nat & Daniel, Nvidia etc? Last blog nearly 2 years ago, no announcements since then, no demo of the ultra long context windows... silence?

altryne's tweet photo. Random Q:

Anyone know what happened to @magicailabs , who supposedly cracked the "100 Million context windows" and got investments by Nat & Daniel, Nvidia etc?

Last blog nearly 2 years ago, no announcements since then, no demo of the ultra long context windows... silence? https://t.co/mkhI6hS6NY

217

53K

ccui42 retweeted

Epoch AI

@EpochAIResearch

10 days ago

Are we nearing a compute crunch? In our latest Gradient Update, @luke__emberson and @Jsevillamol estimate how many tokens all the Blackwell chips on Earth could serve, and compare this to total token demand. Direct comparisons are difficult, but it appears demand is growing much faster than supply.

EpochAIResearch's tweet photo. Are we nearing a compute crunch?

In our latest Gradient Update, @luke__emberson and @Jsevillamol estimate how many tokens all the Blackwell chips on Earth could serve, and compare this to total token demand. Direct comparisons are difficult, but it appears demand is growing much faster than supply.

270

137

29K

ccui42 retweeted

Claude

@claudeai

11 days ago

Six Claude projects that all came from the same question: “why not?”

598

12K

444

ccui42 retweeted

Sebastien Bubeck

@SebastienBubeck

10 days ago

We now know that with an appropriate harness both Mythos and GPT-5.5 can reproduce what our internal model did in one-shot for the unit distance problem. Clearly there is an insane overhang of capabilities with this generation of models, and no ceiling in sight for what scientific advances they can bring. You can go and try to discover new things with 5.5 right now!

274

214K

ccui42 retweeted

Rohan Paul

@rohanpaul_ai

12 days ago

A large MoE model may be wasting half its expert compute on tokens that barely need expert help. In this paper 50% of expert computation removed, with almost no loss in accuracy. This makes already-trained MoE models like Qwen3 and GLM stop calling half their experts when a token is too easy to need them. Zero-Expert Self-Distillation Adaptation (ZEDA), a low-cost framework that transforms post-trained static MoE models into efficient dynamic ones. Shows that many MoE tokens do not need real experts, only permission to skip them. That sounds like a small routing trick, but it changes the economics of deployed language models. Standard MoE models already avoid using every parameter, yet they still spend the same expert budget on every token. ZEDA adds a strange new option to the router: experts that output exactly nothing. When the model routes a token to one of these zero experts, it is not making the model dumber; it is admitting that this token does not need another expensive transformation. The clever part is not the dummy expert, but the adaptation method. Instead of retraining the model from scratch, the original MoE becomes a frozen teacher, while the new dynamic version learns when it can safely skip work. Across Qwen3-30B-A3B and GLM-4.7-Flash, the result is roughly half the expert computation removed, with only marginal average accuracy loss and about 20% real inference speedup. The deeper finding is: compute use did not simply track task difficulty. The model spent more expert budget where uncertainty or teacher-student disagreement rose, while structured code and math fragments often needed less. That makes ZEDA feel less like pruning and more like attention to computational doubt. ---- Paper Link – arxiv. org/abs/2605.18643 Paper Title: "Post-Trained MoE Can Skip Half Experts via Self-Distillation"

rohanpaul_ai's tweet photo. A large MoE model may be wasting half its expert compute on tokens that barely need expert help.

In this paper 50% of expert computation removed, with almost no loss in accuracy.

This makes already-trained MoE models like Qwen3 and GLM stop calling half their experts when a token is too easy to need them.

Zero-Expert Self-Distillation Adaptation (ZEDA), a low-cost framework that transforms post-trained static MoE models into efficient dynamic ones.

Shows that many MoE tokens do not need real experts, only permission to skip them.

That sounds like a small routing trick, but it changes the economics of deployed language models.

Standard MoE models already avoid using every parameter, yet they still spend the same expert budget on every token.

ZEDA adds a strange new option to the router: experts that output exactly nothing.

When the model routes a token to one of these zero experts, it is not making the model dumber; it is admitting that this token does not need another expensive transformation.

The clever part is not the dummy expert, but the adaptation method.

Instead of retraining the model from scratch, the original MoE becomes a frozen teacher, while the new dynamic version learns when it can safely skip work.

Across Qwen3-30B-A3B and GLM-4.7-Flash, the result is roughly half the expert computation removed, with only marginal average accuracy loss and about 20% real inference speedup.

The deeper finding is: compute use did not simply track task difficulty.

The model spent more expert budget where uncertainty or teacher-student disagreement rose, while structured code and math fragments often needed less.

That makes ZEDA feel less like pruning and more like attention to computational doubt.

----

Paper Link – arxiv. org/abs/2605.18643

Paper Title: "Post-Trained MoE Can Skip Half Experts via Self-Distillation"

322

250

22K

ccui42 retweeted

François Fleuret

@francoisfleuret

12 days ago

IMO a researcher studies a problem that may not be solvable, while an engineer solves a problem that is considered solvable.

200

295

297K

ccui42 retweeted

Percy Liang

@percyliang

12 days ago

Not only do we want to train a good model, we want to know it'll be good before we even start training. About a month ago, the Marin team launched a 129B (16B active) 1e23 FLOPs MoE run and preregistered a loss of 2.252. The run finished this past week and landed at 2.234. https://t.co/OptaVa7jIO

percyliang's tweet photo. Not only do we want to train a good model, we want to know it'll be good before we even start training.

About a month ago, the Marin team launched a 129B (16B active) 1e23 FLOPs MoE run and preregistered a loss of 2.252. The run finished this past week and landed at 2.234.
https://t.co/OptaVa7jIO

617

269

64K

ccui42 retweeted

cozyblaze @cozyblazex

15 days ago

I redid the multi-digit multiplication experiment, now with gpt-5.5. With medium reasoning and 7 samples each cell, it pretty much aced the test with 99.46% accuracy. The model had no tools to call and had to rely on its reasoning. Can it go further? (1/4)

cozyblazex's tweet photo. I redid the multi-digit multiplication experiment, now with gpt-5.5. With medium reasoning and 7 samples each cell, it pretty much aced the test with 99.46% accuracy. The model had no tools to call and had to rely on its reasoning. Can it go further? (1/4) https://t.co/D6943YVWw6

960

338

180K

ccui42 retweeted

Tri Dao

@tri_dao

15 days ago

After some mathematical rewrite, turns out all of transformer is a series of gemm + epilogue. Given a few optimized primitives, LLMs (and novice humans) can write speed-of-light kernels for all transformer ops!

128

945

130K

ccui42 retweeted

MTS @MTSlive

16 days ago

SITUATION ANALYSIS: Alongside the proof of the counterexample, OpenAI released a companion paper in which leading mathematicians from several fields share a human-verified version. But the paper also includes the personal reflections of the mathematicians, and these range from general meditations on the role of AI in moving the field of mathematics forward to more specific thoughts on aspects of the counterexample and its significance. We think these are just as interesting as the headline story, and have highlighted several below. The full paper, linked below, is worth reading. Thomas Bloom (.@thomasfbloom) "On examining the construction, it becomes more clear how people had missed this before – it requires the confluence of several different unlikely events: that a good mathematician is: (1) spending significant time in thinking about the unit distance conjecture in the first place; (2) seriously trying to disprove it, despite the oft-repeated belief of Erdős that it is true; (3) believes that there is mileage in generalising the original construction to other number fields, and so is willing to expend significant time in exploring such constructions; and (4) sufficiently familiar with the relevant parts of class field theory to recognise that the appropriately phrased question about infinite towers of number fields with appropriate parameters can be solved using existing theory." W T Gowers (.@wtgowers) After reflecting on methods for estimating the difficulty of mathematics problems that AI models are able to solve or assist with, and concluding that this particular counterexample was relatively low difficulty: "My current bet is that progress in AI mathematics is *not* about to reach a plateau, and that we will soon see AI solutions to many problems that we will find hard to explain away as easier than expected with hindsight." Melanie Matchett Wood (Professor of Mathematics at Harvard University) "I believe if the level and type of human expertise that is represented on this note had been assembled to find a counterexample to this conjecture a month ago, and those people put in similar amounts of time working on it than they did to reading and thinking about Chat GPT’s solution, the mathematicians would have found a counterexample." "We can all be reminded by this development of how frequently interesting and powerful things happen mathematically when one applies ideas from one field to another, and think about how AI can help us find more cross-field applications."

22K

ccui42 retweeted

Margaret Li @margs_li

19 days ago

MoEs are everywhere, but the design space is confusing: total vs active experts? expert size? shared experts? routing? token dropping? We train >2000 MoE LMs 🫠 to investigate and bring you: 📄🔪🍰 Slicing and Dicing MoEs Tl;dr: it's all about expert size and count [1/9]

margs_li's tweet photo. MoEs are everywhere, but the design space is confusing: total vs active experts? expert size? shared experts? routing? token dropping?

We train >2000 MoE LMs 🫠 to investigate and bring you:

📄🔪🍰 Slicing and Dicing MoEs

Tl;dr: it's all about expert size and count

[1/9] https://t.co/zROgT2TAE3

376

337

36K

Chenwei Cui

@ccui42

Last Seen Users on Sotwe

Trends for you

Most Popular Users