HELLoRA cuts MoE fine-tuning parameters by 84% while boosting accuracy
New adapter method targets frequently activated experts in Mixture-of-Experts models, achieving 9.2% accuracy improvement on OlMoE while using 15.7% of standard LoRA's trainable parameters.
HELLoRA is a parameter-efficient fine-tuning technique that attaches Low-Rank Adaptation modules only to the most frequently activated experts in each layer of Mixture-of-Experts language models. The method exploits the sparse activation patterns inherent to MoE architectures—where only a subset of expert networks process each token—by concentrating adapter capacity where it matters most. While LoRA has become the dominant approach for fine-tuning large language models without retraining all parameters, most variants target dense architectures and leave MoE models underexplored.
Tested across OlMoE-1B-7B, Mixtral-8x7B, and DeepSeekMoE on mathematical reasoning, code generation, and safety alignment tasks, HELLoRA consistently outperformed standard LoRA and other parameter-efficient baselines. On OlMoE, HELLoRA used 15.7% of vanilla LoRA's trainable parameters, cut adapter FLOPs by 38.7%, delivered 1.9× training throughput, and improved accuracy by 9.2%. On DeepSeekMoE, it matched or exceeded LoRA performance while requiring only 23.2% of its parameter budget. The authors also introduced HELLoRI, an extreme variant that freezes the up-projection and sparsifies the down-projection to push parameter budgets even lower. For practitioners running open-weight MoE models locally, this matters: fewer trainable parameters mean lower memory overhead during fine-tuning and faster iteration cycles.
