Continued pretraining and instruction tuning on large-scale multilingual data have proven to be effective in scaling large language models (LLMs) to low-resource languages. However, the unaligned nature of such data limits its ability to effectively capture cross-lingual semantics. In contrast, multi-way parallel data, where identical content is aligned across multiple languages, provides stronger cross-lingual consistency and offers greater potential for improving multilingual performance. In this paper, we introduce a large-scale, high-quality multi-way parallel corpus, TED2025, based on TED Talks. The corpus spans 113 languages, with up to 50 languages aligned in parallel, ensuring extensive multilingual coverage. Using this dataset, we investigate best practices for leveraging multi-way parallel data to enhance LLMs, including strategies for continued pretraining, instruction tuning, and the analysis of key influencing factors. Experiments on six multilingual benchmarks show that models trained on multiway parallel data consistently outperform those trained on unaligned multilingual data.
翻译:在大规模多语言数据上进行持续预训练和指令微调已被证明是有效扩展大语言模型至低资源语言的方法。然而,此类数据的非对齐性质限制了其有效捕捉跨语言语义的能力。相比之下,多向平行数据——即相同内容在多种语言间对齐——提供了更强的跨语言一致性,并为提升多语言性能提供了更大潜力。本文介绍了一个基于TED Talks构建的大规模、高质量多向平行语料库TED2025。该语料库涵盖113种语言,其中多达50种语言以平行方式对齐,确保了广泛的多语言覆盖。利用该数据集,我们研究了利用多向平行数据增强大语言模型的最佳实践,包括持续预训练策略、指令微调策略以及关键影响因素分析。在六个多语言基准测试上的实验表明,基于多向平行数据训练的模型性能始终优于基于非对齐多语言数据训练的模型。