Instruction finetuning (IFT) is critical for aligning Large Language Models (LLMs) to follow instructions. While many effective IFT datasets have been introduced recently, they predominantly focus on high-resource languages like English. To better align LLMs across a broad spectrum of languages and tasks, we propose a fully synthetic, novel taxonomy (Evol) guided Multilingual, Multi-turn instruction finetuning dataset, called M2Lingual. It is constructed by first selecting a diverse set of seed examples and then utilizing the proposed Evol taxonomy to convert these seeds into complex and challenging multi-turn instructions. We demonstrate the effectiveness of M2Lingual by training LLMs of varying sizes and showcasing the enhanced performance across a diverse set of languages. We contribute the 2 step Evol taxonomy with the guided generation code: https://github.com/ServiceNow/M2Lingual, as well as the first fully synthetic, general and task-oriented, multi-turn, multilingual dataset built with Evol - M2Lingual: https://huggingface.co/datasets/ServiceNow-AI/ M2Lingual - containing 182K total IFT pairs, covering 70 languages and 17+ NLP tasks.
翻译:指令微调(IFT)对于使大型语言模型(LLMs)遵循指令至关重要。尽管近期引入了许多有效的IFT数据集,但它们主要集中于英语等高资源语言。为了更好地在广泛的语言和任务范围内对齐LLMs,我们提出了一个完全合成的、基于新颖分类法(Evol)指导的多语言、多轮指令微调数据集,称为M2Lingual。其构建过程首先选择一组多样化的种子示例,然后利用提出的Evol分类法将这些种子转化为复杂且具有挑战性的多轮指令。我们通过训练不同规模的LLMs并展示其在多种语言上增强的性能,证明了M2Lingual的有效性。我们贡献了包含指导生成代码的两步Evol分类法:https://github.com/ServiceNow/M2Lingual,以及首个使用Evol构建的完全合成的、通用且面向任务的、多轮、多语言数据集——M2Lingual:https://huggingface.co/datasets/ServiceNow-AI/M2Lingual。该数据集包含总计182K个IFT对,涵盖70种语言和17种以上的NLP任务。