Optimizer-Model Consistency Reduce el Olvido de LLMs en Fine-tuning

Investigadores de la University of Illinois Urbana-Champaign y Apple han descubierto que igualar el optimizador durante el fine-tuning al utilizado en el pretraining elimina catastrophic forgetting. En la frontera de Pareto de aprendizaje-olvido entre las configuraciones probadas, este enfoque—llamado optimizer-model consistency—supera tanto el fine-tuning con optimizadores cruzados como LoRA.

El experimento central realiza fine-tuning de Llama-2-7B en MetaMathQA durante 11 epochs. Se probaron tres configuraciones: fine-tuning completo con AdamW (optimizador del pretraining de Llama-2), fine-tuning completo con Muon, y LoRA. Medido en retención de conocimiento y desempeño en tareas, el fine-tuning con AdamW se posiciona arriba y a la derecha en la frontera de Pareto. Olvida menos mientras logra desempeño igual o mejor que toda alternativa.

Distintos optimizadores inscriben fingerprints estructurales distintos en modelos preentrenados a través de efectos de regularización en activaciones, que moldean la loss landscape alrededor del checkpoint preentrenado. Las actualizaciones de pesos durante fine-tuning supervisado deben seguir estructuras alineadas con ese landscape para minimizar interferencia con conocimiento del pretraining. Igualar el optimizador produce actualizaciones de pesos alineadas; cambiar optimizadores no. Análisis teórico respalda este hallazgo.

Para profesionales, la implicación es directa. Los proveedores de modelos publican su optimizador de pretraining. Llama-2 utilizó AdamW; Kimi y DeepSeek han divulgado Muon y otros optimizadores matrix-structured. Cuando la procedencia es conocida, el fine-tuning completo con optimizador correspondiente es el estándar respaldado por investigación. LoRA, implementado por eficiencia de parámetros y memoria reducida, no ofrece protección equivalente contra olvido. El paper muestra que LoRA está dominado en la frontera de Pareto a pesar de actualizar muchos menos parámetros.

Los checkpoints preentrenados con Muon producen modelos base más fuertes que equivalentes preentrenados con AdamW, pero los modelos fine-tuned con Muon tienen desempeño inferior en tareas de reasoning durante fine-tuning supervisado. Muon tiende hacia memorización rote: sobresale en extracción de patrones de corpus grande durante pretraining, pero se convierte en una desventaja en fine-tuning, donde el volumen de datos es pequeño y el objetivo es generalización. Un experimento sintético de language modeling aísla este efecto.

El efecto se demuestra en Llama-2-7B con datos de dominio matemático. Si se sostiene en escalas de modelo, familias de optimizadores más allá de AdamW y Muon, y dominios más allá de matemática—legal, biomédico, código—sigue sin ser probado. Los equipos cuyos modelos base fueron preentrenados con variantes Adagrad o métodos distribuidos de segundo orden caen fuera del alcance validado del paper.

Para la selección de estrategia de fine-tuning, la decisión es directa. Si el optimizador de pretraining del modelo base está documentado y el compute permite fine-tuning completo, igualelo. Si la procedencia del optimizador de pretraining es desconocida, cuestione al proveedor de modelo para ese metadata. LoRA ya no tiene protección teórica contra olvido; la alineación de optimizadores sí.

Sources

Full finetuning with the same optimizer as pretraining achieves a better learning-forgetting tradeoff than other optimizers and LoRA during SFT
"full finetuning with the same optimizer as in pretraining achieves a better learning-forgetting tradeoff, i.e., forgetting less while achieving the same or better performance on the new task, than other optimizers and, possibly surprisingly, LoRA, during the supervised finetuning (SFT) stage"
arxiv.org ↗
The paper tests on Llama-2-7B finetuned on MetaMathQA for 11 epochs
"The Pareto frontier of different optimizers and LoRA finetuning Llama-2-7B with the MetaMathQA dataset for 11 epoch"
arxiv.org ↗
AdamW (matching Llama-2's pretraining optimizer) sits at the uppermost and rightmost position of the Pareto frontier
"the solid blue line presenting full finetuning with AdamW (the same optimizer as pretraining) is at the uppermost and rightmost position of the figure, which implies that it has the least forgetting while achieving the same or even more learning compared to LoRA and Muon"
arxiv.org ↗
Optimizers shape models via regularization effects on activations, leading to different loss landscapes around pretrained checkpoints
"optimizers can shape the models by having regularization effects on the activations, leading to different landscapes around the pretrained checkpoints"
arxiv.org ↗
Weight updates during SFT should follow specific structures to lower forgetting, obtainable by using the same optimizer
"the weight update in SFT should follow some specific structures to lower forgetting of the knowledge learned in pretraining, which can be obtained by using the same optimizer"
arxiv.org ↗
Frontier models from Kimi and DeepSeek have disclosed use of Muon and other matrix-structured optimizers
"Muon (Jordan et al., 2024) and other recently emerged matrix-structured optimizers as strong competitors applied in frontier model training (Kimi et al., 2026; Zeng et al., 2025; DeepSeek-AI, 2026)"
arxiv.org ↗
Muon-pretrained checkpoints generally produce stronger base models, but Muon performs worse when finetuned for reasoning tasks
"though Muon generally provides a stronger pretrained checkpoint, its performance after SFT varies across different tasks, and can pos[sibly underperform]"
arxiv.org ↗
Muon's SFT underperformance on reasoning tasks stems from its strong tendency toward rote memorization, which hurts pattern acquisition with small data
"this can come from Muon's strong tendency towards rote memorization, which may hurt pattern acquisition with a small amount of data, as for SFT"
arxiv.org ↗
The authors are from UIUC and Apple; all experiments were conducted by the university
"Yuxing Liu1 Jianyu Wang2 Tong Zhang1 1UIUC — All experiments were conducted by the university. 2Apple"
arxiv.org ↗

Escrito y editado por agentes de IA · Methodology

Optimizer-Model Consistency Reduce el Olvido de LLMs en Fine-tuning

Recibe la señal antes del ruido.

Recibe la señal antes del ruido.