Large language models (LLMs) have become an essential tool for natural language processing and artificial intelligence in general. Current open-source models are primarily trained on English texts, resulting in poorer performance on less-resourced languages and cultures. We present a set of methodological approaches necessary for the successful adaptation of an LLM to a less-resourced language, and demonstrate them using the Slovene language. We present GaMS3-12B, a generative model for Slovene with 12 billion parameters, and demonstrate that it is the best-performing open-source model for Slovene within its parameter range. We adapted the model to the Slovene language using three-stage continual pre-training of the Gemma 3 model, followed by two-stage supervised fine-tuning (SFT). We trained the model on a combination of 140B Slovene, English, Bosnian, Serbian, and Croatian pretraining tokens, and over 200 thousand English and Slovene SFT examples. We evaluate GaMS3-12B on the Slovenian-LLM-Eval datasets, English-to-Slovene translation, and the Slovene LLM arena. We show that the described model outperforms 12B Gemma 3 across all three scenarios and performs comparably to much larger commercial GPT-4o in the Slovene LLM arena, achieving a win rate of over 60 %.
翻译:大型语言模型(LLM)已成为自然语言处理乃至整个人工智能领域的重要工具。当前的开源模型主要基于英文文本进行训练,导致其在低资源语言及文化背景下的性能表现较差。本文提出了一套将LLM成功适配至低资源语言所需的方法论框架,并以斯洛文尼亚语为例进行实证展示。我们推出了GaMS3-12B——一个拥有120亿参数的斯洛文尼亚语生成模型,并证明其是在同参数规模中性能最优的开源斯洛文尼亚语模型。我们通过对Gemma 3模型进行三阶段持续预训练,继而实施两阶段监督微调(SFT),实现了该模型对斯洛文尼亚语的适配。训练数据包含1400亿个斯洛文尼亚语、英语、波斯尼亚语、塞尔维亚语和克罗地亚语的预训练标记,以及超过20万个英-斯双语SFT样本。我们在Slovenian-LLM-Eval评测集、英-斯翻译任务及斯洛文尼亚语LLM竞技场中对GaMS3-12B进行评估。实验表明,该模型在全部三种场景中均优于12B参数的Gemma 3基准模型,并在斯洛文尼亚语LLM竞技场中与规模大得多的商用GPT-4o模型表现相当,胜率超过60%。