yjxiong @bitxiong - Twitter Profile

yjxiong @bitxiong

13 days ago

if you ever wonder how to set hyperparameters for your huge MoE models, here is it.

Hongwu Peng

@Hongwu_Peng

19 days ago

📷 "Scale the MoE, Reuse the Sweep" — excited to share our new research paper "Complete-muE" from Adobe Research. Complete-muE is a compositional hyperparameter transfer rule that turns a single small-dense FFN sweep into the right learning-rate, weight-decay, and initialization for any large MoE architecture at any training scale — any activated count, total experts, granularity, shared experts, group-balanced routing, width, depth, batch size, and training duration (iterations). At ~6.3B-total / 0.62B-active scale, this one dense-tuned setting delivers ~2.5× convergence speedup on 256P image diffusion, ~4.5× on 240P 5s video, and ~5.3-5.5× on LLM pretraining — all vs. a dense baseline at identical hyperparameters. Works across LLM and diffusion for image and video generation, with zero per-architecture re-tuning. 📷 Paper: https://t.co/hNzPZV0j70 📷 Blog: https://t.co/kgt5cEeEnZ #MoE #LLM #Diffusion #AdobeResearch

1

33

8

27

3K

0

1

0

5

bitxiong retweeted

Hongwu Peng

@Hongwu_Peng

19 days ago

📷 "Scale the MoE, Reuse the Sweep" — excited to share our new research paper "Complete-muE" from Adobe Research. Complete-muE is a compositional hyperparameter transfer rule that turns a single small-dense FFN sweep into the right learning-rate, weight-decay, and initialization for any large MoE architecture at any training scale — any activated count, total experts, granularity, shared experts, group-balanced routing, width, depth, batch size, and training duration (iterations). At ~6.3B-total / 0.62B-active scale, this one dense-tuned setting delivers ~2.5× convergence speedup on 256P image diffusion, ~4.5× on 240P 5s video, and ~5.3-5.5× on LLM pretraining — all vs. a dense baseline at identical hyperparameters. Works across LLM and diffusion for image and video generation, with zero per-architecture re-tuning. 📷 Paper: https://t.co/hNzPZV0j70 📷 Blog: https://t.co/kgt5cEeEnZ #MoE #LLM #Diffusion #AdobeResearch

1

33

8

27

3K

bitxiong retweeted

Yue Zhao

@__yuezhao__

6 months ago

Discrete or continuous tokens? Or even tokenizer-free? The visual modeling debate rages on, but for now, let me introduce L24SQ, a provably optimal, regularizer-free quantizer with a large codebook (~200k), achieving SoTA reconstruction-compression tradeoff and generative power!

__yuezhao__'s tweet photo. Discrete or continuous tokens? Or even tokenizer-free? The visual modeling debate rages on, but for now, let me introduce L24SQ, a provably optimal, regularizer-free quantizer with a large codebook (~200k), achieving SoTA reconstruction-compression tradeoff and generative power! https://t.co/sN3Xt9cq9y

4

198

34

155

30K

yjxiong @bitxiong

over 1 year ago

@NeurIPSConf of course. blatant racism is now called cultural generalization in our community. keep on reinventing terms. what a time to be alive.

0

12

0

1K

Who to follow

Qing Qu

@qu_1006

Assistant Professor at Umich ECE. Research interest: machine learning, optimization, data science. A runner 🏃 in spare time.

Andrea Michi

@andreamichi

CTO @depthfirstlabs - Autonomous security from design to production. Prev RL post-training Gemini @GoogleDeepMind

Muhammad Ghifary

@MuhammadGhifary

AI Enthusiast | Musicaholic | Football & Badminton Lover

bitxiong retweeted

scooby snacks @scoobydsnacks

over 1 year ago

@pmddomingos If the word “Chinese” was replaced with the word “Jews”, the professor would be fired.

0

31

1

0

1K

bitxiong retweeted

MrNeRF

@janusch_patas

almost 2 years ago

Did you ever want to train a scene with one billion gaussians? Here comes a "tutorial": "RetinaGS: Scalable Training for Dense Scene Rendering with Billion-Scale 3D Gaussians" https://t.co/mmRLGLj5D1

janusch_patas's tweet photo. Did you ever want to train a scene with one billion gaussians? Here comes a "tutorial":

"RetinaGS: Scalable Training for Dense Scene Rendering with Billion-Scale 3D Gaussians"
https://t.co/mmRLGLj5D1 https://t.co/8UPID24jSG

0

98

19

43

7K

bitxiong retweeted

Natural Language Processing Papers @HEI

about 2 years ago

A Full-duplex Speech Dialogue Scheme Based On Large Language Models. https://t.co/HhX9aqtrFy

0

2

1

0

94

bitxiong retweeted

Amazon Science

@AmazonScience

about 5 years ago

Amazon VP Stefano Soatto says that three of his team's five @CVPR papers are about making AI more "graceful": one on backward-compatible updates to ML models, one on cross-device compatibility, and one on linearization of nonlinear models. #CVPR2021 https://t.co/rvTBCxYslh

0

8

4

1

0

bitxiong retweeted