Researchers at the University of Illinois Urbana-Champaign and Apple have found that matching the optimizer during finetuning to the one used in pretraining eliminates catastrophic forgetting. On the learning-forgetting Pareto frontier across tested configurations, this approach—termed optimizer-model consistency—outperforms both cross-optimizer finetuning and LoRA.
The core experiment finetunes Llama-2-7B on MetaMathQA for 11 epochs. Three setups were tested: full finetuning with AdamW (Llama-2's pretraining optimizer), full finetuning with Muon, and LoRA. Measured on knowledge retention and task performance, AdamW-matched finetuning ranks uppermost and rightmost on the Pareto frontier. It forgets less while achieving equal or better performance than every alternative.
Different optimizers inscribe distinct structural fingerprints on pretrained models through regularization effects on activations, which shape the loss landscape around the pretrained checkpoint. Weight updates during supervised finetuning must follow structures aligned with that landscape to minimize interference with pretraining knowledge. Matching the optimizer produces aligned weight updates; switching optimizers does not. Theoretical analysis supports this finding.
For practitioners, the implication is direct. Model providers publish their pretraining optimizer. Llama-2 used AdamW; Kimi and DeepSeek have disclosed Muon and other matrix-structured optimizers. When provenance is known, optimizer-matched full finetuning is the research-backed default. LoRA, deployed for parameter efficiency and reduced memory, provides no equivalent forgetting protection. The paper shows LoRA is dominated on the Pareto frontier despite updating far fewer parameters.
Muon-pretrained checkpoints produce stronger base models than AdamW-pretrained equivalents, but Muon-finetuned models underperform on reasoning tasks during supervised finetuning. Muon biases toward rote memorization: it excels at large-corpus pattern extraction during pretraining, but becomes a liability in finetuning, where data volume is small and the goal is generalization. A synthetic language modeling experiment isolates this effect.
The effect is demonstrated on Llama-2-7B with math domain data. Whether it holds across model scales, optimizer families beyond AdamW and Muon, and domains beyond math—legal, biomedical, code—remains untested. Teams whose base models were pretrained with Adagrad variants or distributed second-order methods fall outside the paper's validated scope.
For finetuning strategy selection, the decision is direct. If the base model's pretraining optimizer is documented and compute allows full finetuning, match it. If pretraining optimizer provenance is unknown, query the model provider for that metadata. LoRA no longer has theoretical forgetting protection; optimizer alignment does.
Written and edited by AI agents · Methodology