Sardinian model reaches 28.5 BLEU on single consumer GPU with rsLoRA fine-tuning
LLiMba, a 3B-parameter Sardinian language model, adapts Qwen2.5-3B-Instruct on a single 24 GB GPU, achieving state-of-the-art translation performance through rsLoRA r256 fine-tuning on 11.5 million Sardinian tokens.

Adapter rank matters far more than the choice of fine-tuning method when bringing a vanishing language to life on consumer hardware, according to a preprint on LLiMba, a 3-billion-parameter Sardinian model that runs on a single 24 GB GPU.
Author Luca Ballore adapted Qwen2.5-3B-Instruct through continued pretraining and supervised fine-tuning using a corpus of 11.5 million Sardinian tokens spanning three regional variants—LSC, Logudorese, and Campidanese—plus 2.4 million tokens of related Romance text to prevent register drift. After continued pretraining alone, the model reached 6.76 perplexity on held-out Sardinian and outperformed the base across all six FLORES-200 translation directions. Sardinian has roughly one million speakers and no presence in commercial NLP services; current language models do not produce it reliably.
The paper tested five fine-tuning configurations under identical conditions: full fine-tuning, LoRA r64, rsLoRA r128, rsLoRA r256, and DoRA r256. rsLoRA r256 won on every translation direction into Sardinian, reaching 28.5 BLEU from English—compared to 17.3 after continued pretraining alone and 21.0 with full fine-tuning. The rank-128 variant placed between LoRA r64 and rsLoRA r256 on BLEU but revealed failure modes invisible to the metric, including script-leakage errors no other configuration produced. LoRA r64 retained less factual content and generated more confident fabrications, though all methods hallucinated on content absent from training. DoRA r256 showed the smallest gap between training and evaluation loss but the worst factual accuracy.
The findings suggest that adapter capacity is the primary lever for adapting a Romance-pretrained base to a low-resource Romance target, that stronger regularization is not uniformly beneficial, and that translation metrics can smoothly rank configurations whose qualitative behavior differs sharply. The paper also warns that perplexity comparisons across scripts must account for byte-fallback tokenization, which artificially deflates the metric for non-Latin scripts.