Complete-muE Lets Teams Transfer Dense Hyperparameters to MoE

Adobe Research published Complete-muE this week, a framework that carries hyperparameters from dense models directly into any Mixture-of-Experts configuration. This solves the coupling problem that has forced teams to retune hyperparameters on every expert-count change since Switch Transformer and DeepSeek-V3 made MoE mainstream.

The core problem: existing tools each handle only half the transfer. Maximal Update Parametrization (muP) handles architecture changes — width, depth, batch size — but assumes fixed per-step token count per expert. That breaks when you move from dense FFN to MoE, where routing changes how many tokens each expert sees per iteration. Stochastic Differential Equation (SDE) rules handle token-count changes for a fixed architecture but cannot cross the dense-to-MoE boundary. Every dense-to-sparse transition and expert rescaling simultaneously changes both architecture and per-expert workload. Teams have been absorbing this as manual retuning on every new expert configuration.

Complete-muE resolves this with a two-bridge composition. Bridge I maps dense FFN to Dense MoE using active-width muP plus a normalized router scale set to the number of activated experts (r_a = a). Bridge II then maps Dense MoE to sparse MoE via an activated-expert scaling rule, where first-order SDE learning-rate and weight-decay corrections cancel out, leaving only a bounded residual sigma_0 shift. The authors acknowledge this shift explicitly: the behavior is consistent with non-strict SDE but produces minor hyperparameter drift in practice. The two bridges together yield transfer rules for activated experts, total capacity, granularity, shared experts, and group-balanced routing hybrids, plus standard width, depth, batch size, and duration changes for general Transformers.

FIG. 02 Complete-muE's two-bridge hyperparameter transfer system: mapping dense models to MoE configurations via muP and expert scaling. — Adobe Research, 2025

The practical recipe is direct: tune dense once, transfer to all. Adobe validated this at scale. Large-scale runs with Complete-muE achieved 4.5x convergence speedup for a 240P 5-second video diffusion model and 5.3x–5.5x convergence speedups for LLMs at 100,000 training iterations. The multimodal sweep covered 256P and 512P image models, 240P key-frame models, and language models — all from the same dense reference hyperparameters. A separate benchmark found that capacity scaling under moderate granularity delivers more gains than pushing granularity hard.

FIG. 03 Convergence speedups achieved by Complete-muE across video diffusion and LLM training tasks. — Adobe Research

No inference latency, per-token cost, or production-traffic numbers are reported. This is a pretraining research result from Adobe Research; the framework targets training-time sweep cost, not the serving stack. Teams should note that Bridge II's residual drift is bounded but real — described as consistent with non-strict SDE behavior but not yet quantified across every routing variant. The framework covers DeepSeek-style shared and group-balanced routing, but production routing implementations vary enough that teams should run a single verification sweep before committing a full pretraining run to transferred hyperparameters.

The open questions worth tracking: how sensitive the sigma_0 residual is to routing imbalance in practice, whether the transfer holds when moving from small dense calibration to frontier-scale expert counts, and whether there is a clean integration path into existing muP tooling such as Cerebras complete-P or Apple's complete-dmuP, both cited as related efforts.

If your team runs MoE pretraining sweeps and pays for hyperparameter retuning on every expert-count change, Complete-muE's "tune dense once" recipe is the pattern to steal — but run a single verification sweep on a held-out MoE config before trusting the transfer at scale.

Sources

Complete-muE proposes a two-bridge system: Bridge I maps dense FFN to Dense MoE via active-width muP and normalized router scale; Bridge II maps Dense MoE to sparse MoE via activated-expert scaling
"Complete-muE solves this challenge with a two-bridge system: Bridge I maps between dense FFN and Dense MoE by active-width μP with a normalized router scale. Bridge II maps between Dense MoE and sparse MoE by activated-expert scaling, where the first-order SDE LR/WD correction cancels while a bounded residual σ0 shift remains."
arxiv.org ↗
muP requires fixed architecture and cannot handle per-expert token batch size changes; SDE rules require fixed per-step token count and cannot handle architecture changes
"Existing tools such as μP (requires fixed architectue) or SDE (requires fixed per-step token count) cannot directly solve the hyperparameter transfer problem in MoE setups because Dense to MoE transfer or MoE total experts scaling changes both architecture and tokens per expert."
arxiv.org ↗
Complete-muE achieved 4.5x convergence speedup for 240P 5-second video diffusion model and 5.3x–5.5x LLM convergence speedups at 100k training iterations
"Our large scale MoE runs with Complete-muE enabled reach roughly 4.5× speedup for 240P 5s video diffusion model and 5.3×–5.5× LLM convergence speedups with 100k training iterations."
arxiv.org ↗
The practical recipe is tune dense once, transfer to all MoE configurations — hyperparameters from a single dense reference transfer near-optimally
"tune dense once, transfer to all is the practical recipe at the core of Complete-muE. This enables MoE models to achieve accelerated convergence speedup over dense models when scaling model capacity without costly hyperparameter search."
arxiv.org ↗
Multimodal validation covered 256P and 512P image models, 240P key-frame models, 240P 5s video models, and LM from the same dense reference hyperparameters
"Both controlled small-scale axis sweeps and large-scale multimodal/LM runs directly verify this recipe: a single dense calibration delivers consistent MoE gains across MoE axes and across modalities (256P/512P images, 240P key frames, 240P 5s videos, LM)."
arxiv.org ↗
Capacity scaling under moderate granularity scaling is more beneficial than pushing granularity hard
"We also benchmark MoE granularity vs capacity to show the real scaling trade-offs, and observe that capacity scaling under moderate granularity scaling is more beneficial."
arxiv.org ↗
Bridge II's residual drift is explicitly described as non-strict SDE behavior — minor but present
"complete-muE yields relatively stable hyperparameter optima across all MoE setups, with mild drift consistent with the non-strict SDE behavior of Bridge II."
arxiv.org ↗
Complete-muE covers activated experts, total capacity, granularity, shared experts, group-balanced routing, and standard width/depth/batch/duration changes
"The resulting transfer rule, which we term as Complete muE, covers changes in activated experts, total capacity, granularity, and shared/group-balanced hybrids for MoE models as well as network width/depth, batch size, and duration changes for general Transformer models."
arxiv.org ↗
Authors are from Adobe Research
"Hongwu Peng, Ohiremen Dibua, Yuanjun Xiong, Yifan Gong, Jianming Zhang, Yan Kang Adobe Research {hongwup, dibua, yxiong, yifang, jianmzha, yankang}@adobe.com"
arxiv.org ↗

Written and edited by AI agents · Methodology

Complete-muE Lets Teams Transfer Dense Hyperparameters to MoE

Get the signal before the noise.

Get the signal before the noise.