We introduce MoUE.
A new MoE paradigm boosts base-model performance by up to 1.3 points from scratch and up to 4.2 points on average, without increasing either activated parameters or total parameters.
The main idea is simple:
a sufficiently wide MoE layer with recursive reuse can be treated as a strict generalization of standard MoE.
https://t.co/UTagOXUD0y
https://t.co/LsnL5GEIaX
#MoE #LLM #MixtureOfExperts #SparseModels #ScalingLaws #Modularity #UniversalTransformers #RecursiveComputation #ContinualPretraining
@YouJiacheng@classiclarryd We can quickly verify it on nano, but I don't see a particularly big gap between this article and other btye tokenizers, such as the source of 20x gain.
@karpathy Used your repo for my latest experiments, super cool stuff! I did notice that a lot of the gains come from hyperparameter tweaks and existing methods, though. Any ideas on how to take it a step further into some really original territory?
Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project.
This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.:
- It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work.
- It found that the Value Embeddings really like regularization and I wasn't applying any (oops).
- It found that my banded attention was too conservative (i forgot to tune it).
- It found that AdamW betas were all messed up.
- It tuned the weight decay schedule.
- It tuned the network initialization.
This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism.
https://t.co/WAz8aIztKT
All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges.
And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.
Yeah, totally — I think this is a very meaningful point. And honestly, it’s super valuable for me to get feedback from someone with real large-scale training experience. I’ve been thinking about these issues a lot too.
I think you’re right here: if we try to share across the whole model, at least for now it actually seems worse than more local / smaller-scale sharing, just because the systems complexity gets too high.
Really appreciate the suggestion overall. And please feel free to keep the ideas coming — I’d love to hear more.
One more thought: if sharing is on the table, infra may get a new optimization lever. On an existing cluster, better PP / expert placement is not only a throughput issue — it can also shape expert locality, and possibly how much specialization MoUE can extract in practice.
So in some cases, colocating experts may help quality too, not just systems efficiency.
@selfattentive@teortaxesTex Our intuition is that PP and experts operate on different axes. PP slices the model by depth, while experts slice it by function.
As long as a PP rank contains multiple experts, routing can still produce specialization locally and you still get benefits from parameter sharing.
Also, increasing expert count often means increasing node count, which has real cost both in training and inference. So another angle is: instead of constantly adding parameters and hardware, can we organize the existing parameters better so they specialize more effectively?
If that works, you can get algorithmic gains without paying the full economic cost of scaling out.
@selfattentive@teortaxesTex Another lever here is experts per node. If expert capacity is useful, you don’t necessarily have to scale EP across more nodes.
You can also increase the number of experts within a node, which keeps routing more local and reduces cross-node traffic.
@selfattentive@teortaxesTex We may also want to re-search the optimal tradeoff among depth, width, and expert count. Since some capacity can be covered by shared experts, we might allocate parameters to more important parts of the model instead, such as hidden dimension.
@strong_signal1 Yes, that’s a feasible approach for downstream systems built on open-source models.
For model developers though it might work even better, since they have access to the original pretraining data and can maintain a more consistent expert distribution.
We introduce MoUE.
A new MoE paradigm boosts base-model performance by up to 1.3 points from scratch and up to 4.2 points on average, without increasing either activated parameters or total parameters.
The main idea is simple:
a sufficiently wide MoE layer with recursive reuse can be treated as a strict generalization of standard MoE.
https://t.co/UTagOXUD0y
https://t.co/LsnL5GEIaX
#MoE #LLM #MixtureOfExperts #SparseModels #ScalingLaws #Modularity #UniversalTransformers #RecursiveComputation #ContinualPretraining
@teortaxesTex@kalomaze Great intuition! Stronger routers can actually benefit from these harder load-balancing regimes. We’ve also tried a few small router improvements (Universal Router and I'll try yours). At this point standard LBL is mostly solved — maybe it’s time to focus on the harder cases. 😂
@rudzinskimaciej@teortaxesTex Yes! I have tried this idea, and it has some effect, but the improvement is not particularly significant. The challenge lies in achieving balanced training while ensuring good utilization of both local and shared experts.
@rudzinskimaciej@teortaxesTex What you said makes a lot of sense! Considering that this topology can have countless combinations, there is a lot of work to be done to study how to reduce the complexity of the algorithm's search space while achieving good results.
Thank you for your interest! CPT does indeed yield excellent MoUE results. In our experiments, we achieved good results without even complex design and hyperparameter searches (for example, universal expert selection was directly randomized). However, CPT requires a very fine-grained warmup to ensure that the routing does not crash.
MoEUT is a representative work combining MoE and UT, and I really like it! However, they mainly focus on model-level recursion, which is different from the layer-level reuse problem that MoUE addresses. Nevertheless, I still really like it and will add a discussion section to the paper. Thank you for your suggestion!
The result is a useful scaling trade:
instead of buying capacity mainly with more activated compute or more stored parameters,
we can trade **algorithmic structure** for capacity by increasing global reusable experts and their recursive compositions.
In practice:
- up to +1.3 avg from scratch with no increase in activated params or total params
- ~+2.5 in depth expansion
- up to +4.2% avg in checkpoint conversion / CPT
Our bet is that MoE may scale not only by adding more experts,
but by making experts more reusable, modular, and globally composable.
That is the direction behind MoUE.
The UELB point is central.
Under reuse, load balancing should not just be layer-local.
It should reflect the computation graph.
That gives a new depth-wise / topology-aware view of load balancing:
balance experts relative to where they can be used, not how often they appear globally.
This is a different optimization problem from standard MoE.