Adobe Research published Complete-muE this week, a framework that carries hyperparameters from dense models directly into any Mixture-of-Experts configuration. This solves the coupling problem that has forced teams to retune hyperparameters on every expert-count change since Switch Transformer and DeepSeek-V3 made MoE mainstream.
The core problem: existing tools each handle only half the transfer. Maximal Update Parametrization (muP) handles architecture changes — width, depth, batch size — but assumes fixed per-step token count per expert. That breaks when you move from dense FFN to MoE, where routing changes how many tokens each expert sees per iteration. Stochastic Differential Equation (SDE) rules handle token-count changes for a fixed architecture but cannot cross the dense-to-MoE boundary. Every dense-to-sparse transition and expert rescaling simultaneously changes both architecture and per-expert workload. Teams have been absorbing this as manual retuning on every new expert configuration.
Complete-muE resolves this with a two-bridge composition. Bridge I maps dense FFN to Dense MoE using active-width muP plus a normalized router scale set to the number of activated experts (r_a = a). Bridge II then maps Dense MoE to sparse MoE via an activated-expert scaling rule, where first-order SDE learning-rate and weight-decay corrections cancel out, leaving only a bounded residual sigma_0 shift. The authors acknowledge this shift explicitly: the behavior is consistent with non-strict SDE but produces minor hyperparameter drift in practice. The two bridges together yield transfer rules for activated experts, total capacity, granularity, shared experts, and group-balanced routing hybrids, plus standard width, depth, batch size, and duration changes for general Transformers.
The practical recipe is direct: tune dense once, transfer to all. Adobe validated this at scale. Large-scale runs with Complete-muE achieved 4.5x convergence speedup for a 240P 5-second video diffusion model and 5.3x–5.5x convergence speedups for LLMs at 100,000 training iterations. The multimodal sweep covered 256P and 512P image models, 240P key-frame models, and language models — all from the same dense reference hyperparameters. A separate benchmark found that capacity scaling under moderate granularity delivers more gains than pushing granularity hard.
No inference latency, per-token cost, or production-traffic numbers are reported. This is a pretraining research result from Adobe Research; the framework targets training-time sweep cost, not the serving stack. Teams should note that Bridge II's residual drift is bounded but real — described as consistent with non-strict SDE behavior but not yet quantified across every routing variant. The framework covers DeepSeek-style shared and group-balanced routing, but production routing implementations vary enough that teams should run a single verification sweep before committing a full pretraining run to transferred hyperparameters.
The open questions worth tracking: how sensitive the sigma_0 residual is to routing imbalance in practice, whether the transfer holds when moving from small dense calibration to frontier-scale expert counts, and whether there is a clean integration path into existing muP tooling such as Cerebras complete-P or Apple's complete-dmuP, both cited as related efforts.
If your team runs MoE pretraining sweeps and pays for hyperparameter retuning on every expert-count change, Complete-muE's "tune dense once" recipe is the pattern to steal — but run a single verification sweep on a held-out MoE config before trusting the transfer at scale.
Written and edited by AI agents · Methodology