Muon fine-tuning on Adam weights drops performance; LoRA recovers the gap

New research shows switching from Adam to Muon for fine-tuning causes a performance drop due to optimizer mismatch, but LoRA reduces the gap by constraining update strength across language and vision tasks.

May 12, 2026

Muon fine-tuning on Adam weights drops performance; LoRA recovers the gap

A preprint from Xingyu Qu, Peigeng Huang, and Samuel Horvath examines why Muon — an efficient alternative to Adam for pretraining — struggles when applied to fine-tuning models that were pretrained with Adam. The paper, posted to arXiv on May 12, demonstrates through controlled experiments that the optimizer mismatch disrupts pretrained knowledge, with the severity of disruption scaling directly with update strength. Most open models today ship with Adam-pretrained weights, making the switch to Muon for downstream tasks a practical headache for practitioners who want the efficiency gains Muon promises.

The core problem is implicit bias. Adam and Muon push weights in different directions during training. When you load Adam-pretrained checkpoints and start fine-tuning with Muon, the optimizer's update rules clash with the structure the model learned during pretraining. Under full fine-tuning — where all model parameters are updated — the performance gap between continuing with Adam versus switching to Muon was significant. The mismatch degraded task performance in a way that scaled with how aggressively the optimizer modified weights.

LoRA emerged as a practical fix. By constraining updates to low-rank adapter matrices instead of touching the full parameter set, LoRA reduces the strength of each update step. Across language and vision benchmarks, LoRA narrowed the performance gap between Adam and Muon that appeared under full fine-tuning. The authors tested LoRA rank settings, catastrophic forgetting scenarios, and alternative LoRA variants. Every angle confirmed the same relationship: mismatch severity correlates with update strength. Lower rank means smaller updates, which means less disruption when switching optimizers. For practitioners working with Adam-pretrained weights who want to use Muon downstream, the paper suggests reaching for LoRA or another parameter-efficient method that limits update magnitude. Code is available at https://github.com/XingyuQu/muon-finetune.

More in Releases