While large language models (LLMs) have been pre-trained on multilingual corpora, their performance still lags behind in most languages compared to a few resource-rich languages. One common approach to mitigate this issue is to translate training data from resource-rich languages into other languages and then continue training. However, using the data obtained solely relying on translation while ignoring the original capabilities of LLMs across languages is not always effective, which we show will limit the performance of cross-lingual knowledge transfer. In this work, we propose SDRRL, a method based on Self-Distillation from Resource-Rich Languages that effectively improve multilingual performance by leveraging the internal capabilities of LLMs on resource-rich languages. We evaluate on different LLMs (LLaMA-2 and SeaLLM) and source languages across various comprehension and generation tasks, experimental results demonstrate that SDRRL can significantly enhance multilingual capabilities while minimizing the impact on original performance in resource-rich languages.
翻译:尽管大语言模型已在多语言语料库上进行预训练,但其在大多数语言上的表现仍落后于少数资源丰富语言。缓解此问题的常见方法是将训练数据从资源丰富语言翻译成其他语言并继续训练。然而,完全依赖翻译获得的数据而忽略LLMs在不同语言中的原始能力并不总是有效的——我们研究表明这会限制跨语言知识迁移的性能。本文提出SDRRL方法,基于资源丰富语言的自我蒸馏技术,通过利用LLMs在资源丰富语言上的内部能力有效提升多语言表现。我们在不同LLMs(LLaMA-2和SeaLLM)及源语言上,针对多种理解与生成任务进行评估,实验结果表明SDRRL能显著增强多语言能力,同时最大限度减少对资源丰富语言原有性能的影响。