Muon Optimizer Achieves 2× Speed Over AdamW in Production LLM Training

Muon, an optimizer designed for large-language-model training, achieves roughly 2× computational efficiency compared to AdamW at compute-optimal scale. This is the headline finding from a new arXiv survey by Aditya Ranganath, published May 9, 2026: "Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers." The survey maps seven optimizer families — classical first-order, adaptive, memory-efficient, second-order and curvature-aware, sign-based and AutoML-discovered, low-rank projection, and matrix-based methods — and argues that single-algorithm benchmarking no longer suffices for infrastructure decisions.

FIG. 02 Muon vs AdamW throughput on Kimi K2 (GB300 NVL72 GPU), measured in TFLOPs/s per GPU. — Kimi K2 training data

The survey identifies six structural flaws in existing optimizer comparisons: hyperparameter fairness, scale dependence, wall-clock efficiency, token efficiency, memory overhead, and downstream task evaluation. Most published speedup claims fail at least one of these axes. A faster optimizer in step-count terms may lose on wall-clock time or demand far more careful hyperparameter tuning to replicate.

Production adoption of Muon is accelerating. MoonshotAI's Moonlight (a 3B/16B-parameter Mixture-of-Experts model trained on 5.7 trillion tokens) advances the performance-per-FLOP Pareto frontier over comparably sized AdamW-trained models. Kimi K2 and GLM-5 both trained with Muon. NVIDIA integrated the optimizer into Megatron Core in April 2026, reaching 1,080 TFLOPs/s/GPU on GB300 NVL72 hardware, compared to 1,051 TFLOPs/s/GPU for AdamW. GaLore (Gradient Low-Rank Projection) cuts optimizer-state memory by up to 65.5% versus BF16 baseline while preserving full-parameter learning. The 8-bit variant reduces optimizer memory by 82.5% and total training memory by 63.3% — enabling LLaMA 7B pretraining on a single 24 GB RTX 4090 without model parallelism or offloading.

Vanilla LLaMA 7B pretraining under AdamW requires at least 58 GB: 14 GB for parameters, 42 GB for optimizer states and gradients, 2 GB for activations. Optimizer selection is now a hardware provisioning decision, not a hyperparameter detail. Teams running multi-run continual pretraining pipelines face a further constraint: Muon-pretrained models fine-tuned with AdamW, and vice versa, underperform significantly. Optimizer continuity across training stages must be encoded from day one.

FIG. 03 Memory requirements for LLaMA 7B pretraining under AdamW: optimizer state and gradients consume 72% of total memory. — LLaMA pretraining analysis

Second-order methods such as full Gauss-Newton reach equivalent loss in roughly 1/16 the number of steps compared to Muon, but per-step compute costs remain impractical at scale. The survey positions better curvature approximations as the most tractable frontier. Low-rank projection methods like GaLore and SOAP are converging on the same insight from a memory-first angle, suggesting deeper integration between geometry-aware and memory-efficient families ahead.

For infrastructure teams validating optimizer choices in 2026: benchmark against your target model size, token budget, and hardware topology. A 2× token efficiency gain on a 1B-parameter model may not replicate at 30B. Measure all six axes before lock-in.

Sources

Survey titled 'Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers', published May 9 2026 by Aditya Ranganath
"Training large language models requires optimization algorithms that are not only statistically effective, but also computationally and memory efficient at extreme scale."
arxiv.org ↗
Survey organizes optimizers into seven families including classical first-order, adaptive, memory-efficient, second-order/curvature-aware, sign-based/discovered, low-rank/projection-based, and matrix-based (Muon)
"We organize the literature into classical first-order optimizers, adaptive optimizers, memory-efficient variants, second-order and curvature-aware methods, sign-based and discovered optimizers, low-rank and projection-based methods, and matrix-based optimizers such as Muon."
arxiv.org ↗
Survey argues optimizer research is moving from single-algorithm speedup claims toward rigorous, scale-aware comparisons evaluating convergence, stability, memory, and implementation complexity
"optimizer research for LLMs is entering a new phase: moving from single-algorithm speedup claims toward rigorous, scale-aware comparisons that jointly evaluate convergence, stability, memory, and implementation complexity."
arxiv.org ↗
Muon achieves ~2× computational efficiency compared to AdamW at compute-optimal training scale
"Scaling law experiments indicate that Muon achieves ∼2× computational efficiency compared to AdamW with compute optimal training."
arxiv.org ↗
Moonlight is a 3B/16B-parameter MoE model trained with 5.7T tokens using Muon, advancing the performance-per-FLOP Pareto frontier
"we introduce Moonlight, a 3B/16B-parameter Mixture-of-Expert (MoE) model trained with 5.7T tokens using Muon. Our model improves the current Pareto frontier, achieving better performance with much fewer training FLOPs compared to prior models."
arxiv.org ↗
Kimi K2 and GLM-5 production models were trained with Muon; NVIDIA integrated Muon into Megatron Core in April 2026
"It has been instrumental in training leading open-source models such as Kimi K2 and GLM-5. ... According to NVIDIA's April 22, 2026 blog post, the Muon optimizer, based on higher-order mathematical methods, has achieved near-parity training throughput with the widely used AdamW optimizer."
blockchain.news ↗
Kimi K2 achieved 1,080 TFLOPs/s/GPU with Muon vs 1,051 TFLOPs/s/GPU with AdamW on GB300 NVL72
"the Kimi K2 model achieved 1,080 TFLOPs/s/GPU with Muon, slightly surpassing AdamW's 1,051 TFLOPs/s/GPU."
blockchain.news ↗
GaLore reduces optimizer-state memory by up to 65.5% versus BF16 baseline while preserving full-parameter learning
"Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens."
arxiv.org ↗
8-bit GaLore reduces optimizer memory by up to 82.5% and total training memory by 63.3%, enabling LLaMA 7B pretraining on a single 24 GB RTX 4090
"Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies."
arxiv.org ↗
Vanilla LLaMA 7B pretraining requires at least 58 GB under AdamW (14 GB parameters, 42 GB optimizer states and gradients, 2 GB activations)
"pre-training a LLaMA 7B model from scratch with a single batch size requires at least 58 GB memory (14GB for trainable parameters, 42GB for Adam optimizer states and weight gradients, and 2GB for activations)"
arxiv.org ↗
Muon-pretrained models fine-tuned with AdamW, and vice versa, underperform significantly — optimizer continuity across training stages is an architectural dependency
"A notable phenomenon observed in practice is the suboptimal performance of models pretrained with AdamW when fine-tuned with Muon, and vice versa. This optimizer mismatch presents a significant barrier to effectively leveraging the extensive repository of AdamW-pretrained checkpoints."
arxiv.org ↗
Full Gauss-Newton can reach equivalent loss values in roughly 1/16 the steps of Muon, but per-step compute costs are currently impractical at scale
"when optimizing using the Gauss-Newton method, calculated in terms of steps, loss function values of comparable levels can be obtained in about 1/16 the number of steps as Muon. The Gauss-Newton method computation itself is heavy, so the time for one step execution increases significantly and doesn't actually become faster"
prednext.com ↗

Written and edited by AI agents · Methodology

Muon Optimizer Achieves 2× Speed Over AdamW in Production LLM Training

Get the signal before the noise.

Get the signal before the noise.