Optimizer-Model Consistency Cuts LLM Forgetting in Finetuning

Researchers at the University of Illinois Urbana-Champaign and Apple have found that matching the optimizer during finetuning to the one used in pretraining eliminates catastrophic forgetting. On the learning-forgetting Pareto frontier across tested configurations, this approach—termed optimizer-model consistency—outperforms both cross-optimizer finetuning and LoRA.

The core experiment finetunes Llama-2-7B on MetaMathQA for 11 epochs. Three setups were tested: full finetuning with AdamW (Llama-2's pretraining optimizer), full finetuning with Muon, and LoRA. Measured on knowledge retention and task performance, AdamW-matched finetuning ranks uppermost and rightmost on the Pareto frontier. It forgets less while achieving equal or better performance than every alternative.

Different optimizers inscribe distinct structural fingerprints on pretrained models through regularization effects on activations, which shape the loss landscape around the pretrained checkpoint. Weight updates during supervised finetuning must follow structures aligned with that landscape to minimize interference with pretraining knowledge. Matching the optimizer produces aligned weight updates; switching optimizers does not. Theoretical analysis supports this finding.

For practitioners, the implication is direct. Model providers publish their pretraining optimizer. Llama-2 used AdamW; Kimi and DeepSeek have disclosed Muon and other matrix-structured optimizers. When provenance is known, optimizer-matched full finetuning is the research-backed default. LoRA, deployed for parameter efficiency and reduced memory, provides no equivalent forgetting protection. The paper shows LoRA is dominated on the Pareto frontier despite updating far fewer parameters.

Muon-pretrained checkpoints produce stronger base models than AdamW-pretrained equivalents, but Muon-finetuned models underperform on reasoning tasks during supervised finetuning. Muon biases toward rote memorization: it excels at large-corpus pattern extraction during pretraining, but becomes a liability in finetuning, where data volume is small and the goal is generalization. A synthetic language modeling experiment isolates this effect.

The effect is demonstrated on Llama-2-7B with math domain data. Whether it holds across model scales, optimizer families beyond AdamW and Muon, and domains beyond math—legal, biomedical, code—remains untested. Teams whose base models were pretrained with Adagrad variants or distributed second-order methods fall outside the paper's validated scope.

For finetuning strategy selection, the decision is direct. If the base model's pretraining optimizer is documented and compute allows full finetuning, match it. If pretraining optimizer provenance is unknown, query the model provider for that metadata. LoRA no longer has theoretical forgetting protection; optimizer alignment does.

Sources

Full finetuning with the same optimizer as pretraining achieves a better learning-forgetting tradeoff than other optimizers and LoRA during SFT
"full finetuning with the same optimizer as in pretraining achieves a better learning-forgetting tradeoff, i.e., forgetting less while achieving the same or better performance on the new task, than other optimizers and, possibly surprisingly, LoRA, during the supervised finetuning (SFT) stage"
arxiv.org ↗
The paper tests on Llama-2-7B finetuned on MetaMathQA for 11 epochs
"The Pareto frontier of different optimizers and LoRA finetuning Llama-2-7B with the MetaMathQA dataset for 11 epoch"
arxiv.org ↗
AdamW (matching Llama-2's pretraining optimizer) sits at the uppermost and rightmost position of the Pareto frontier
"the solid blue line presenting full finetuning with AdamW (the same optimizer as pretraining) is at the uppermost and rightmost position of the figure, which implies that it has the least forgetting while achieving the same or even more learning compared to LoRA and Muon"
arxiv.org ↗
Optimizers shape models via regularization effects on activations, leading to different loss landscapes around pretrained checkpoints
"optimizers can shape the models by having regularization effects on the activations, leading to different landscapes around the pretrained checkpoints"
arxiv.org ↗
Weight updates during SFT should follow specific structures to lower forgetting, obtainable by using the same optimizer
"the weight update in SFT should follow some specific structures to lower forgetting of the knowledge learned in pretraining, which can be obtained by using the same optimizer"
arxiv.org ↗
Frontier models from Kimi and DeepSeek have disclosed use of Muon and other matrix-structured optimizers
"Muon (Jordan et al., 2024) and other recently emerged matrix-structured optimizers as strong competitors applied in frontier model training (Kimi et al., 2026; Zeng et al., 2025; DeepSeek-AI, 2026)"
arxiv.org ↗
Muon-pretrained checkpoints generally produce stronger base models, but Muon performs worse when finetuned for reasoning tasks
"though Muon generally provides a stronger pretrained checkpoint, its performance after SFT varies across different tasks, and can pos[sibly underperform]"
arxiv.org ↗
Muon's SFT underperformance on reasoning tasks stems from its strong tendency toward rote memorization, which hurts pattern acquisition with small data
"this can come from Muon's strong tendency towards rote memorization, which may hurt pattern acquisition with a small amount of data, as for SFT"
arxiv.org ↗
The authors are from UIUC and Apple; all experiments were conducted by the university
"Yuxing Liu1 Jianyu Wang2 Tong Zhang1 1UIUC — All experiments were conducted by the university. 2Apple"
arxiv.org ↗

Written and edited by AI agents · Methodology

Optimizer-Model Consistency Cuts LLM Forgetting in Finetuning

Get the signal before the noise.

Get the signal before the noise.