📷 "Scale the MoE, Reuse the Sweep" — excited to share our new research paper "Complete-muE" from Adobe Research.
Complete-muE is a compositional hyperparameter transfer rule that turns a single small-dense FFN sweep into the right learning-rate, weight-decay, and initialization for any large MoE architecture at any training scale — any activated count, total experts, granularity, shared experts, group-balanced routing, width, depth, batch size, and training duration (iterations). At ~6.3B-total / 0.62B-active scale, this one dense-tuned setting delivers ~2.5× convergence speedup on 256P image diffusion, ~4.5× on 240P 5s video, and ~5.3-5.5× on LLM pretraining — all vs. a dense baseline at identical hyperparameters. Works across LLM and diffusion for image and video generation, with zero per-architecture re-tuning.
📷 Paper: https://t.co/hNzPZV0j70
📷 Blog: https://t.co/kgt5cEeEnZ
#MoE #LLM #Diffusion #AdobeResearch
@sun_hanchi Correction: Optimal hyperparameter leads to stable pretraining with almost no loss spike. But it will still face load imbalance. Just the load imbalance doesn't hurt model quality during pretraining: like we don't use load balance loss for dense model.
Huge thanks to my incredible collaborators — none of this would have been possible without you: Ohi Dibua, Yuanjun Xiong, Yifan Gong, Jianming Zhang, Yan Kang 🙌
📷 "Scale the MoE, Reuse the Sweep" — excited to share our new research paper "Complete-muE" from Adobe Research.
Complete-muE is a compositional hyperparameter transfer rule that turns a single small-dense FFN sweep into the right learning-rate, weight-decay, and initialization for any large MoE architecture at any training scale — any activated count, total experts, granularity, shared experts, group-balanced routing, width, depth, batch size, and training duration (iterations). At ~6.3B-total / 0.62B-active scale, this one dense-tuned setting delivers ~2.5× convergence speedup on 256P image diffusion, ~4.5× on 240P 5s video, and ~5.3-5.5× on LLM pretraining — all vs. a dense baseline at identical hyperparameters. Works across LLM and diffusion for image and video generation, with zero per-architecture re-tuning.
📷 Paper: https://t.co/hNzPZV0j70
📷 Blog: https://t.co/kgt5cEeEnZ
#MoE #LLM #Diffusion #AdobeResearch
Thrilled that our paper “DTop-p MoE” was accepted to #ICML2026! 🚀
Done during my Adobe Research internship, this work makes MoE routing adaptive while keeping pre-training compute controlled.
Paper: https://t.co/OAG8bUvMrB
#MoE#LLM#EfficientAI
@elonmusk@iScienceLuvr For fair comparison, the paper should apply text augmentation into AR LLMs for multi-epoch training, not just overfitting the AR LLM and claim it doesn't work well
🚀 Big News!
Our latest preprint is out:
🧠 “Two Heads Are Better Than One: Test-time Scaling of Multi-agent Collaborative Reasoning”
Introducing M1-32B — an LLM fine-tuned for multi-agent collaboration on M500, a dataset of 500 rich reasoning traces.
👇 (1/4)
Adobe announced the addition of new video generation capabilities to its Firefly AI model and Premiere Pro
The new Firefly Video Model is now in 'limited public beta' and allows users to generate video from text prompts or images
Very proud to introduce two of our recent long-context works:
HELMET (best long-context benchmark imo): https://t.co/xF5MwlJORz
ProLong (a cont’d training & SFT recipe + a SoTA 512K 8B model): https://t.co/PmaVyRRa4X
Here is a story of how we arrived there
5 papers you want to read to understand better how @OpenAI o1 might work. Focusing on Improving LLM reasoning capabilities for complex tasks via training/RLHF, not prompting. 👀
> Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking (https://t.co/ote01ei0hQ) from Stanford
> Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents (https://t.co/ZkGtrzRcV9) from MultiOn/Stanford
> Let's Verify Step by Step (https://t.co/g2iPYVMd6N) from OpenAI
> V-STaR: Training Verifiers for Self-Taught Reasoners (https://t.co/34VLwce0HZ**) from Microsoft, Mila**
> Learn Beyond The Answer: Training Language Models with Reflection for Mathematical Reasoning (https://t.co/BJE5u62B39) from Notre Dam, Tencent
I'm not claiming this is how O1 works, but it helps us better understand it. I'll share summary posts in the coming days. Make sure to follow! 🫡
Mindblowing! 🤯 A 70B open @AIatMeta Llama 3 better than @AnthropicAI Claude 3.5 Sonnet and @OpenAI GPT-4o using Reflection-Tuning! In Reflection Tuning, the LLM is trained on synthetic, structured data to learn reasoning and self-correction. 👀
In the assistant response, the LLM:
1️⃣ Begins by outputting its reasoning within <thinking> tags.
2️⃣ If the model detects an error in its reasoning, it uses <reflection> tags within the <thinking> section to signal this and attempt to correct itself.
3️⃣ Once satisfied with its reasoning, it provides the final answer within <output> tags.
Model Results:
🏆 89.9% MMLU, 79.7% MATH, 90.1% IFEval > Sonnet 3.5, GPT-4o
🥇World's top open LLM (as of release) & checked for contamination using LMSys's LLM Decontaminator
🦙 Trained from Llama 3.1 70B Instruct with new special tokens for <thinking>, <reflection>, <output>
🚀 405B model in development, expected to be the best existing model
🤗 Available on @huggingface
🌡️ Generation parameter temperature 0.7, top_p 0.95
🤔 No, success on an 8B scale yet
🐌 Additional <thinking> leads to increases in the output token count and e2e latency
📚 Dataset and training report coming next week
Model: https://t.co/eCvf82rRK1
Big Kudos to @mattshumer_, @csahil28 and @GlaiveAI.
I'm excited to announce Reflection 70B, the world’s top open-source model.
Trained using Reflection-Tuning, a technique developed to enable LLMs to fix their own mistakes.
405B coming next week - we expect it to be the best model in the world.
Built w/ @GlaiveAI.
Read on ⬇️: