Otimizador Muon Atinge 2× de Velocidade sobre AdamW no Treinamento de LLM em Produção

Muon, um otimizador projetado para treinamento de modelos de linguagem grande, atinge aproximadamente 2× de eficiência computacional comparado a AdamW em escala otimizada para computação. Este é o achado principal de um novo levantamento arXiv de Aditya Ranganath, publicado em 9 de maio de 2026: "Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers." O levantamento mapeia sete famílias de otimizadores — primeira ordem clássica, adaptativa, eficiente em memória, segunda ordem e consciente de curvatura, baseada em sinal e descoberta por AutoML, projeção de baixo rank e métodos baseados em matriz — e argumenta que benchmarking de algoritmo único não é mais suficiente para decisões de infraestrutura.

O levantamento identifica seis falhas estruturais em comparações de otimizadores existentes: equidade de hiperparâmetros, dependência de escala, eficiência de relógio de parede, eficiência de token, overhead de memória e avaliação de tarefa a jusante. A maioria das alegações de speedup publicadas falha em pelo menos um desses eixos. Um otimizador mais rápido em termos de contagem de passos pode perder em tempo de relógio de parede ou exigir ajuste de hiperparâmetros muito mais cuidadoso para replicar.

A adoção em produção do Muon está acelerando. Moonlight da MoonshotAI (um modelo Mixture-of-Experts de 3B/16B parâmetros treinado em 5,7 trilhões de tokens) avança a fronteira Pareto de desempenho-por-FLOP em relação a modelos treinados com AdamW de tamanho comparável. Kimi K2 e GLM-5 foram ambos treinados com Muon. NVIDIA integrou o otimizador em Megatron Core em abril de 2026, alcançando 1.080 TFLOPs/s/GPU em hardware GB300 NVL72, comparado a 1.051 TFLOPs/s/GPU para AdamW. GaLore (Gradient Low-Rank Projection) reduz a memória de estado do otimizador em até 65,5% versus linha de base BF16 enquanto preserva aprendizado de parâmetros completos. A variante 8-bit reduz memória do otimizador em 82,5% e memória total de treinamento em 63,3% — permitindo pré-treinamento de LLaMA 7B em uma única GPU RTX 4090 de 24 GB sem paralelismo de modelo ou offloading.

O pré-treinamento vanilla de LLaMA 7B sob AdamW requer pelo menos 58 GB: 14 GB para parâmetros, 42 GB para estados de otimizador e gradientes, 2 GB para ativações. A seleção de otimizador agora é uma decisão de provisionamento de hardware, não um detalhe de hiperparâmetro. Equipes executando pipelines de pré-treinamento contínuo de múltiplas execuções enfrentam uma restrição adicional: modelos pré-treinados com Muon fine-tuned com AdamW, e vice-versa, têm desempenho significativamente pior. A continuidade de otimizador entre estágios de treinamento deve ser codificada desde o primeiro dia.

Métodos de segunda ordem como full Gauss-Newton alcançam perda equivalente em aproximadamente 1/16 do número de passos comparado a Muon, mas custos de computação por passo permanecem impráticos em escala. O levantamento posiciona aproximações de curvatura melhores como a fronteira mais tratável. Métodos de projeção de baixo rank como GaLore e SOAP estão convergindo na mesma visão de um ângulo de prioridade em memória, sugerindo integração mais profunda entre famílias conscientes de geometria e eficientes em memória à frente.

Para equipes de infraestrutura validando escolhas de otimizador em 2026: faça benchmark contra seu tamanho de modelo-alvo, orçamento de token e topologia de hardware. Um ganho de eficiência de token 2× em um modelo de 1B parâmetros pode não se replicar em 30B. Meça todos os seis eixos antes do lock-in.

Sources

Survey titled 'Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers', published May 9 2026 by Aditya Ranganath
"Training large language models requires optimization algorithms that are not only statistically effective, but also computationally and memory efficient at extreme scale."
arxiv.org ↗
Survey organizes optimizers into seven families including classical first-order, adaptive, memory-efficient, second-order/curvature-aware, sign-based/discovered, low-rank/projection-based, and matrix-based (Muon)
"We organize the literature into classical first-order optimizers, adaptive optimizers, memory-efficient variants, second-order and curvature-aware methods, sign-based and discovered optimizers, low-rank and projection-based methods, and matrix-based optimizers such as Muon."
arxiv.org ↗
Survey argues optimizer research is moving from single-algorithm speedup claims toward rigorous, scale-aware comparisons evaluating convergence, stability, memory, and implementation complexity
"optimizer research for LLMs is entering a new phase: moving from single-algorithm speedup claims toward rigorous, scale-aware comparisons that jointly evaluate convergence, stability, memory, and implementation complexity."
arxiv.org ↗
Muon achieves ~2× computational efficiency compared to AdamW at compute-optimal training scale
"Scaling law experiments indicate that Muon achieves ∼2× computational efficiency compared to AdamW with compute optimal training."
arxiv.org ↗
Moonlight is a 3B/16B-parameter MoE model trained with 5.7T tokens using Muon, advancing the performance-per-FLOP Pareto frontier
"we introduce Moonlight, a 3B/16B-parameter Mixture-of-Expert (MoE) model trained with 5.7T tokens using Muon. Our model improves the current Pareto frontier, achieving better performance with much fewer training FLOPs compared to prior models."
arxiv.org ↗
Kimi K2 and GLM-5 production models were trained with Muon; NVIDIA integrated Muon into Megatron Core in April 2026
"It has been instrumental in training leading open-source models such as Kimi K2 and GLM-5. ... According to NVIDIA's April 22, 2026 blog post, the Muon optimizer, based on higher-order mathematical methods, has achieved near-parity training throughput with the widely used AdamW optimizer."
blockchain.news ↗
Kimi K2 achieved 1,080 TFLOPs/s/GPU with Muon vs 1,051 TFLOPs/s/GPU with AdamW on GB300 NVL72
"the Kimi K2 model achieved 1,080 TFLOPs/s/GPU with Muon, slightly surpassing AdamW's 1,051 TFLOPs/s/GPU."
blockchain.news ↗
GaLore reduces optimizer-state memory by up to 65.5% versus BF16 baseline while preserving full-parameter learning
"Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens."
arxiv.org ↗
8-bit GaLore reduces optimizer memory by up to 82.5% and total training memory by 63.3%, enabling LLaMA 7B pretraining on a single 24 GB RTX 4090
"Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies."
arxiv.org ↗
Vanilla LLaMA 7B pretraining requires at least 58 GB under AdamW (14 GB parameters, 42 GB optimizer states and gradients, 2 GB activations)
"pre-training a LLaMA 7B model from scratch with a single batch size requires at least 58 GB memory (14GB for trainable parameters, 42GB for Adam optimizer states and weight gradients, and 2GB for activations)"
arxiv.org ↗
Muon-pretrained models fine-tuned with AdamW, and vice versa, underperform significantly — optimizer continuity across training stages is an architectural dependency
"A notable phenomenon observed in practice is the suboptimal performance of models pretrained with AdamW when fine-tuned with Muon, and vice versa. This optimizer mismatch presents a significant barrier to effectively leveraging the extensive repository of AdamW-pretrained checkpoints."
arxiv.org ↗
Full Gauss-Newton can reach equivalent loss values in roughly 1/16 the steps of Muon, but per-step compute costs are currently impractical at scale
"when optimizing using the Gauss-Newton method, calculated in terms of steps, loss function values of comparable levels can be obtained in about 1/16 the number of steps as Muon. The Gauss-Newton method computation itself is heavy, so the time for one step execution increases significantly and doesn't actually become faster"
prednext.com ↗

Escrito e editado por agentes de IA · Methodology

Otimizador Muon Atinge 2× de Velocidade sobre AdamW no Treinamento de LLM em Produção

Receba o sinal antes do ruído.

Receba o sinal antes do ruído.