Large language models often underperform in many European languages due to the dominance of English and a few high-resource languages in training data. This paper presents TildeOpen LLM, a 30-billion-parameter open-weight foundational model trained for 34 European languages to promote linguistic equity and improve performance for low-resource languages. To address the data imbalance, we combine dataset upsampling with a curriculum-based training schedule that alternates between uniform and natural language distributions. The resulting model performs favorably compared to other multilingual LLMs despite being trained with significantly fewer computing resources. Evaluation across multiple multilingual benchmarks shows that TildeOpen surpasses existing open-weight models in text generation and comprehension, particularly for Baltic, Finno-Ugric, and Slavic languages. Human evaluations confirm an up to tenfold reduction in linguistic errors relative to leading baselines. The model and associated resources are fully open-weight and publicly available at huggingface.co/TildeAI/TildeOpen-30b. These outcomes demonstrate that careful data curation and balanced training strategies can substantially enhance multilingual model quality without increasing model size or training volume.
翻译:大型语言模型由于训练数据中英语和少数高资源语言占主导地位,在许多欧洲语言上表现往往不佳。本文提出了TildeOpen LLM,这是一个拥有300亿参数、针对34种欧洲语言训练的开源权重基础模型,旨在促进语言公平并提升低资源语言的性能。为解决数据不平衡问题,我们结合数据集上采样与基于课程学习的训练计划,该计划在均匀分布和自然语言分布之间交替进行。尽管训练使用的计算资源显著减少,所得模型与其他多语言大语言模型相比仍表现出色。在多个多语言基准测试上的评估表明,TildeOpen在文本生成和理解方面超越了现有的开源权重模型,尤其是在波罗的语族、芬兰-乌戈尔语族和斯拉夫语族语言上。人工评估证实,相较于领先的基线模型,其语言错误减少了多达十倍。该模型及相关资源完全开源,可在huggingface.co/TildeAI/TildeOpen-30b公开获取。这些结果表明,精细的数据策展和平衡的训练策略能够在不增加模型规模或训练量的情况下,显著提升多语言模型的质量。