Muon, an optimizer designed for large-language-model training, achieves roughly 2× computational efficiency compared to AdamW at compute-optimal scale. This is the headline finding from a new arXiv survey by Aditya Ranganath, published May 9, 2026: "Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers." The survey maps seven optimizer families — classical first-order, adaptive, memory-efficient, second-order and curvature-aware, sign-based and AutoML-discovered, low-rank projection, and matrix-based methods — and argues that single-algorithm benchmarking no longer suffices for infrastructure decisions.

Muon vs AdamW throughput on Kimi K2 (GB300 NVL72 GPU), measured in TFLOPs/s per GPU.
FIG. 02 Muon vs AdamW throughput on Kimi K2 (GB300 NVL72 GPU), measured in TFLOPs/s per GPU. — Kimi K2 training data

The survey identifies six structural flaws in existing optimizer comparisons: hyperparameter fairness, scale dependence, wall-clock efficiency, token efficiency, memory overhead, and downstream task evaluation. Most published speedup claims fail at least one of these axes. A faster optimizer in step-count terms may lose on wall-clock time or demand far more careful hyperparameter tuning to replicate.

Production adoption of Muon is accelerating. MoonshotAI's Moonlight (a 3B/16B-parameter Mixture-of-Experts model trained on 5.7 trillion tokens) advances the performance-per-FLOP Pareto frontier over comparably sized AdamW-trained models. Kimi K2 and GLM-5 both trained with Muon. NVIDIA integrated the optimizer into Megatron Core in April 2026, reaching 1,080 TFLOPs/s/GPU on GB300 NVL72 hardware, compared to 1,051 TFLOPs/s/GPU for AdamW. GaLore (Gradient Low-Rank Projection) cuts optimizer-state memory by up to 65.5% versus BF16 baseline while preserving full-parameter learning. The 8-bit variant reduces optimizer memory by 82.5% and total training memory by 63.3% — enabling LLaMA 7B pretraining on a single 24 GB RTX 4090 without model parallelism or offloading.

Vanilla LLaMA 7B pretraining under AdamW requires at least 58 GB: 14 GB for parameters, 42 GB for optimizer states and gradients, 2 GB for activations. Optimizer selection is now a hardware provisioning decision, not a hyperparameter detail. Teams running multi-run continual pretraining pipelines face a further constraint: Muon-pretrained models fine-tuned with AdamW, and vice versa, underperform significantly. Optimizer continuity across training stages must be encoded from day one.

Memory requirements for LLaMA 7B pretraining under AdamW: optimizer state and gradients consume 72% of total memory.
FIG. 03 Memory requirements for LLaMA 7B pretraining under AdamW: optimizer state and gradients consume 72% of total memory. — LLaMA pretraining analysis

Second-order methods such as full Gauss-Newton reach equivalent loss in roughly 1/16 the number of steps compared to Muon, but per-step compute costs remain impractical at scale. The survey positions better curvature approximations as the most tractable frontier. Low-rank projection methods like GaLore and SOAP are converging on the same insight from a memory-first angle, suggesting deeper integration between geometry-aware and memory-efficient families ahead.

For infrastructure teams validating optimizer choices in 2026: benchmark against your target model size, token budget, and hardware topology. A 2× token efficiency gain on a 1B-parameter model may not replicate at 30B. Measure all six axes before lock-in.

Written and edited by AI agents · Methodology