Hongwu Peng @Hongwu_Peng - Twitter Profile

Pinned Tweet

19 days ago

📷 "Scale the MoE, Reuse the Sweep" — excited to share our new research paper "Complete-muE" from Adobe Research. Complete-muE is a compositional hyperparameter transfer rule that turns a single small-dense FFN sweep into the right learning-rate, weight-decay, and initialization for any large MoE architecture at any training scale — any activated count, total experts, granularity, shared experts, group-balanced routing, width, depth, batch size, and training duration (iterations). At ~6.3B-total / 0.62B-active scale, this one dense-tuned setting delivers ~2.5× convergence speedup on 256P image diffusion, ~4.5× on 240P 5s video, and ~5.3-5.5× on LLM pretraining — all vs. a dense baseline at identical hyperparameters. Works across LLM and diffusion for image and video generation, with zero per-architecture re-tuning. 📷 Paper: https://t.co/hNzPZV0j70 📷 Blog: https://t.co/kgt5cEeEnZ #MoE #LLM #Diffusion #AdobeResearch

1

33

8

27

3K

Hongwu Peng

@Hongwu_Peng

5 days ago

@sun_hanchi Correction: Optimal hyperparameter leads to stable pretraining with almost no loss spike. But it will still face load imbalance. Just the load imbalance doesn't hurt model quality during pretraining: like we don't use load balance loss for dense model.

0

9

Hongwu Peng

@Hongwu_Peng

19 days ago

Huge thanks to my incredible collaborators — none of this would have been possible without you: Ohi Dibua, Yuanjun Xiong, Yifan Gong, Jianming Zhang, Yan Kang 🙌

0

86

Hongwu Peng

@Hongwu_Peng

19 days ago

📷 "Scale the MoE, Reuse the Sweep" — excited to share our new research paper "Complete-muE" from Adobe Research. Complete-muE is a compositional hyperparameter transfer rule that turns a single small-dense FFN sweep into the right learning-rate, weight-decay, and initialization for any large MoE architecture at any training scale — any activated count, total experts, granularity, shared experts, group-balanced routing, width, depth, batch size, and training duration (iterations). At ~6.3B-total / 0.62B-active scale, this one dense-tuned setting delivers ~2.5× convergence speedup on 256P image diffusion, ~4.5× on 240P 5s video, and ~5.3-5.5× on LLM pretraining — all vs. a dense baseline at identical hyperparameters. Works across LLM and diffusion for image and video generation, with zero per-architecture re-tuning. 📷 Paper: https://t.co/hNzPZV0j70 📷 Blog: https://t.co/kgt5cEeEnZ #MoE #LLM #Diffusion #AdobeResearch

1

33

8

27

3K

Hongwu_Peng retweeted

Can Jin @CanJin12321

about 1 month ago

Thrilled that our paper “DTop-p MoE” was accepted to #ICML2026! 🚀 Done during my Adobe Research internship, this work makes MoE routing adaptive while keeping pre-training compute controlled. Paper: https://t.co/OAG8bUvMrB #MoE #LLM #EfficientAI

1

0

69

Hongwu Peng

@Hongwu_Peng

4 months ago

@bnjmn_marie I just saw you, "So, we multiply by 2"😉

0

1

0

190

Hongwu_Peng retweeted

Jeff Dean

@JeffDean

7 months ago

An exciting new approach for doing continual learning, using nested optimization for enhancing long context processing.

42

2K

158

728

526K

Hongwu Peng

@Hongwu_Peng

7 months ago

@elonmusk @iScienceLuvr For fair comparison, the paper should apply text augmentation into AR LLMs for multi-epoch training, not just overfitting the AR LLM and claim it doesn't work well

0

96

Hongwu_Peng retweeted

Tuo Zhao @tourzhao

8 months ago

🚀 NorMuon: Muon + Neuron-wise adaptive learning rates: +21.7% training efficiency vs Adam, +11.3% vs Muon on 1.1B pretrain. 🚀 Distributed Normuon: A highly efficient FSDP2 implementation. Paper 👉 https://t.co/4OuWAdYbzB #LLM #AI #DeepLearning #Optimizer

tourzhao's tweet photo. 🚀 NorMuon: Muon + Neuron-wise adaptive learning rates: +21.7% training efficiency vs Adam, +11.3% vs Muon on 1.1B pretrain.

🚀 Distributed Normuon: A highly efficient FSDP2 implementation.

Paper 👉 https://t.co/4OuWAdYbzB
#LLM #AI #DeepLearning #Optimizer https://t.co/wVyu6yXUQD

2

178

20

103

25K

Hongwu_Peng retweeted

elie

@eliebakouch

10 months ago

Wow, pretty cool that they also open sourced a FSDP2 compatible Muon and PolyNorm working with @huggingface kernels!

10

184

25

105

41K

Hongwu_Peng retweeted

Can Jin @CanJin12321

about 1 year ago

🚀 Big News! Our latest preprint is out: 🧠 “Two Heads Are Better Than One: Test-time Scaling of Multi-agent Collaborative Reasoning” Introducing M1-32B — an LLM fine-tuned for multi-agent collaboration on M500, a dataset of 500 rich reasoning traces. 👇 (1/4)

1

2

3

0

788

Hongwu_Peng retweeted

Brett Adcock

@adcock_brett

over 1 year ago

Adobe announced the addition of new video generation capabilities to its Firefly AI model and Premiere Pro The new Firefly Video Model is now in 'limited public beta' and allows users to generate video from text prompts or images

1

168

4

21

16K

Hongwu_Peng retweeted

Tianyu Gao @gaotianyu1350

over 1 year ago

Very proud to introduce two of our recent long-context works: HELMET (best long-context benchmark imo): https://t.co/xF5MwlJORz ProLong (a cont’d training & SFT recipe + a SoTA 512K 8B model): https://t.co/PmaVyRRa4X Here is a story of how we arrived there

gaotianyu1350's tweet photo. Very proud to introduce two of our recent long-context works:

HELMET (best long-context benchmark imo): https://t.co/xF5MwlJORz
ProLong (a cont’d training & SFT recipe + a SoTA 512K 8B model): https://t.co/PmaVyRRa4X

Here is a story of how we arrived there https://t.co/MDpXrCEaTR

5

197

46

69

56K

Hongwu Peng

@Hongwu_Peng

over 1 year ago

@daibond_alpha 😂

0

523

Hongwu Peng

@Hongwu_Peng

over 1 year ago

@melon_thief @yuntiandeng It's just for research purpose to understand LLM reasoning behavior😂

0

1

0

766

Hongwu_Peng retweeted

Philipp Schmid

@_philschmid

over 1 year ago

5 papers you want to read to understand better how @OpenAI o1 might work. Focusing on Improving LLM reasoning capabilities for complex tasks via training/RLHF, not prompting. 👀 > Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking (https://t.co/ote01ei0hQ) from Stanford > Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents (https://t.co/ZkGtrzRcV9) from MultiOn/Stanford > Let's Verify Step by Step (https://t.co/g2iPYVMd6N) from OpenAI > V-STaR: Training Verifiers for Self-Taught Reasoners (https://t.co/34VLwce0HZ**) from Microsoft, Mila** > Learn Beyond The Answer: Training Language Models with Reflection for Mathematical Reasoning (https://t.co/BJE5u62B39) from Notre Dam, Tencent I'm not claiming this is how O1 works, but it helps us better understand it. I'll share summary posts in the coming days. Make sure to follow! 🫡

_philschmid's tweet photo. 5 papers you want to read to understand better how @OpenAI o1 might work. Focusing on Improving LLM reasoning capabilities for complex tasks via training/RLHF, not prompting. 👀

> Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking (https://t.co/ote01ei0hQ) from Stanford

> Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents (https://t.co/ZkGtrzRcV9) from MultiOn/Stanford

> Let's Verify Step by Step (https://t.co/g2iPYVMd6N) from OpenAI

> V-STaR: Training Verifiers for Self-Taught Reasoners (https://t.co/34VLwce0HZ**) from Microsoft, Mila**

> Learn Beyond The Answer: Training Language Models with Reflection for Mathematical Reasoning (https://t.co/BJE5u62B39) from Notre Dam, Tencent

I'm not claiming this is how O1 works, but it helps us better understand it. I'll share summary posts in the coming days. Make sure to follow! 🫡

9

387

75

564

56K

Hongwu Peng

@Hongwu_Peng

almost 2 years ago

@OlamicShelter @tianle_cai Even garbage token may help improve reasoning

1

0

97

Hongwu_Peng retweeted

Philipp Schmid

@_philschmid

almost 2 years ago

Mindblowing! 🤯 A 70B open @AIatMeta Llama 3 better than @AnthropicAI Claude 3.5 Sonnet and @OpenAI GPT-4o using Reflection-Tuning! In Reflection Tuning, the LLM is trained on synthetic, structured data to learn reasoning and self-correction. 👀 In the assistant response, the LLM: 1️⃣ Begins by outputting its reasoning within <thinking> tags. 2️⃣ If the model detects an error in its reasoning, it uses <reflection> tags within the <thinking> section to signal this and attempt to correct itself. 3️⃣ Once satisfied with its reasoning, it provides the final answer within <output> tags. Model Results: 🏆 89.9% MMLU, 79.7% MATH, 90.1% IFEval > Sonnet 3.5, GPT-4o 🥇World's top open LLM (as of release) & checked for contamination using LMSys's LLM Decontaminator 🦙 Trained from Llama 3.1 70B Instruct with new special tokens for <thinking>, <reflection>, <output> 🚀 405B model in development, expected to be the best existing model 🤗 Available on @huggingface 🌡️ Generation parameter temperature 0.7, top_p 0.95 🤔 No, success on an 8B scale yet 🐌 Additional <thinking> leads to increases in the output token count and e2e latency 📚 Dataset and training report coming next week Model: https://t.co/eCvf82rRK1 Big Kudos to @mattshumer_, @csahil28 and @GlaiveAI.

_philschmid's tweet photo. Mindblowing! 🤯 A 70B open @AIatMeta Llama 3 better than @AnthropicAI Claude 3.5 Sonnet and @OpenAI GPT-4o using Reflection-Tuning! In Reflection Tuning, the LLM is trained on synthetic, structured data to learn reasoning and self-correction. 👀

In the assistant response, the LLM:
1️⃣ Begins by outputting its reasoning within <thinking> tags.
2️⃣ If the model detects an error in its reasoning, it uses <reflection> tags within the <thinking> section to signal this and attempt to correct itself.
3️⃣ Once satisfied with its reasoning, it provides the final answer within <output> tags.

Model Results:
🏆 89.9% MMLU, 79.7% MATH, 90.1% IFEval > Sonnet 3.5, GPT-4o
🥇World's top open LLM (as of release) & checked for contamination using LMSys's LLM Decontaminator
🦙 Trained from Llama 3.1 70B Instruct with new special tokens for <thinking>, <reflection>, <output>
🚀 405B model in development, expected to be the best existing model
🤗 Available on @huggingface
🌡️ Generation parameter temperature 0.7, top_p 0.95
🤔 No, success on an 8B scale yet
🐌 Additional <thinking> leads to increases in the output token count and e2e latency
📚 Dataset and training report coming next week

Model: https://t.co/eCvf82rRK1

Big Kudos to @mattshumer_, @csahil28 and @GlaiveAI.

33

854

155

576

97K

Hongwu_Peng retweeted

Matt Shumer

@mattshumer_

almost 2 years ago

I'm excited to announce Reflection 70B, the world’s top open-source model. Trained using Reflection-Tuning, a technique developed to enable LLMs to fix their own mistakes. 405B coming next week - we expect it to be the best model in the world. Built w/ @GlaiveAI. Read on ⬇️:

mattshumer_'s tweet photo. I'm excited to announce Reflection 70B, the world’s top open-source model.

Trained using Reflection-Tuning, a technique developed to enable LLMs to fix their own mistakes.

405B coming next week - we expect it to be the best model in the world.

Built w/ @GlaiveAI.

Read on ⬇️:

520

9K

1K

6K

3M

Hongwu Peng

@Hongwu_Peng

almost 2 years ago

@hbXNov @kazemi_sm @arianTBD @agarwl_ @vqctran Does "fixed budget" consider finetuning budget?

1

3

0

1

943

Hongwu Peng

@Hongwu_Peng

Last Seen Users on Sotwe

Trends for you

Most Popular Users