Multilingual proficiency presents a significant challenge for large language models (LLMs). English-centric models are usually suboptimal in other languages, particularly those that are linguistically distant from English. This performance discrepancy mainly stems from the imbalanced distribution of training data across languages during pre-training and instruction tuning stages. To address this problem, we propose a novel approach called CrossIn, which utilizes a mixed composition of cross-lingual instruction tuning data. Our method leverages the compressed representation shared by various languages to efficiently enhance the model's task-solving capabilities and multilingual proficiency within a single process. In addition, we introduce a multi-task and multi-faceted benchmark to evaluate the effectiveness of CrossIn. Experimental results demonstrate that our method substantially improves performance across tasks and languages, and we provide extensive insights into the impact of cross-lingual data volume and the integration of translation data on enhancing multilingual consistency and accuracy.
翻译:多语言能力对大型语言模型(LLMs)构成重大挑战。以英语为中心的模型在其他语言(尤其是与英语语言距离较远的语言)中通常表现欠佳。这种性能差异主要源于预训练和指令微调阶段不同语言训练数据分布的不均衡。为解决该问题,我们提出一种名为CrossIn的新方法,该方法利用跨语言指令微调数据的混合组合。我们的方法借助不同语言共享的压缩表示,在单一过程中高效提升模型的任务解决能力和多语言熟练度。此外,我们引入了一个多任务、多方面的基准来评估CrossIn的有效性。实验结果表明,我们的方法在跨任务和跨语言场景下显著提升了性能,并且我们提供了关于跨语言数据量以及翻译数据整合对增强多语言一致性和准确性的影响的深入见解。