📷 "Scale the MoE, Reuse the Sweep" — excited to share our new research paper "Complete-muE" from Adobe Research.
Complete-muE is a compositional hyperparameter transfer rule that turns a single small-dense FFN sweep into the right learning-rate, weight-decay, and initialization for any large MoE architecture at any training scale — any activated count, total experts, granularity, shared experts, group-balanced routing, width, depth, batch size, and training duration (iterations). At ~6.3B-total / 0.62B-active scale, this one dense-tuned setting delivers ~2.5× convergence speedup on 256P image diffusion, ~4.5× on 240P 5s video, and ~5.3-5.5× on LLM pretraining — all vs. a dense baseline at identical hyperparameters. Works across LLM and diffusion for image and video generation, with zero per-architecture re-tuning.
📷 Paper: https://t.co/hNzPZV0j70
📷 Blog: https://t.co/kgt5cEeEnZ
#MoE #LLM #Diffusion #AdobeResearch
📷 "Scale the MoE, Reuse the Sweep" — excited to share our new research paper "Complete-muE" from Adobe Research.
Complete-muE is a compositional hyperparameter transfer rule that turns a single small-dense FFN sweep into the right learning-rate, weight-decay, and initialization for any large MoE architecture at any training scale — any activated count, total experts, granularity, shared experts, group-balanced routing, width, depth, batch size, and training duration (iterations). At ~6.3B-total / 0.62B-active scale, this one dense-tuned setting delivers ~2.5× convergence speedup on 256P image diffusion, ~4.5× on 240P 5s video, and ~5.3-5.5× on LLM pretraining — all vs. a dense baseline at identical hyperparameters. Works across LLM and diffusion for image and video generation, with zero per-architecture re-tuning.
📷 Paper: https://t.co/hNzPZV0j70
📷 Blog: https://t.co/kgt5cEeEnZ
#MoE #LLM #Diffusion #AdobeResearch
Discrete or continuous tokens? Or even tokenizer-free? The visual modeling debate rages on, but for now, let me introduce L24SQ, a provably optimal, regularizer-free quantizer with a large codebook (~200k), achieving SoTA reconstruction-compression tradeoff and generative power!
Did you ever want to train a scene with one billion gaussians? Here comes a "tutorial":
"RetinaGS: Scalable Training for Dense Scene Rendering with Billion-Scale 3D Gaussians"
https://t.co/mmRLGLj5D1
Amazon VP Stefano Soatto says that three of his team's five @CVPR papers are about making AI more "graceful": one on backward-compatible updates to ML models, one on cross-device compatibility, and one on linearization of nonlinear models. #CVPR2021 https://t.co/rvTBCxYslh