TildeOpen LLM: Leveraging Curriculum Learning to Achieve Equitable Language Representation

Large language models often underperform in many European languages due to the dominance of English and a few high-resource languages in training data. This paper presents TildeOpen LLM, a 30-billion-parameter open-weight foundational model trained for 34 European languages to promote linguistic equity and improve performance for low-resource languages. To address the data imbalance, we combine dataset upsampling with a curriculum-based training schedule that alternates between uniform and natural language distributions. The resulting model performs favorably compared to other multilingual LLMs despite being trained with significantly fewer computing resources. Evaluation across multiple multilingual benchmarks shows that TildeOpen surpasses existing open-weight models in text generation and comprehension, particularly for Baltic, Finno-Ugric, and Slavic languages. Human evaluations confirm an up to tenfold reduction in linguistic errors relative to leading baselines. The model and associated resources are fully open-weight and publicly available at huggingface.co/TildeAI/TildeOpen-30b. These outcomes demonstrate that careful data curation and balanced training strategies can substantially enhance multilingual model quality without increasing model size or training volume.

翻译：大型语言模型由于训练数据中英语和少数高资源语言占主导地位，在许多欧洲语言上表现往往不佳。本文提出了TildeOpen LLM，这是一个拥有300亿参数、针对34种欧洲语言训练的开源权重基础模型，旨在促进语言公平并提升低资源语言的性能。为解决数据不平衡问题，我们结合数据集上采样与基于课程学习的训练计划，该计划在均匀分布和自然语言分布之间交替进行。尽管训练使用的计算资源显著减少，所得模型与其他多语言大语言模型相比仍表现出色。在多个多语言基准测试上的评估表明，TildeOpen在文本生成和理解方面超越了现有的开源权重模型，尤其是在波罗的语族、芬兰-乌戈尔语族和斯拉夫语族语言上。人工评估证实，相较于领先的基线模型，其语言错误减少了多达十倍。该模型及相关资源完全开源，可在huggingface.co/TildeAI/TildeOpen-30b公开获取。这些结果表明，精细的数据策展和平衡的训练策略能够在不增加模型规模或训练量的情况下，显著提升多语言模型的质量。

相关内容

课程

关注 6

课程是指学校学生所应学习的学科总和及其进程与安排。课程是对教育的目标、教学内容、教学活动方式的规划和设计，是教学计划、教学大纲等诸多方面实施过程的总和。广义的课程是指学校为实现培养目标而选择的教育内容及其进程的总和，它包括学校老师所教授的各门学科和有目的、有计划的教育活动。狭义的课程是指某一门学科。专知上对国内外最新AI+X的课程进行了收集与索引，涵盖斯坦福大学、CMU、MIT、清华、北大等名校开放课程。

不可错过！《大语言模型》课程

专知会员服务

30+阅读 · 2025年4月15日

【CIKM2024教程】大语言模型公平性

专知会员服务

20+阅读 · 2024年10月31日

迈向大语言模型偏好学习的统一视角综述

专知会员服务

24+阅读 · 2024年9月7日

158页《大型语言模型数据集》全面综述，444个数据集涵盖预训练、指令微调、偏好、评估等，附中英文版

专知会员服务

155+阅读 · 2024年3月1日