Despite the remarkable success of LLMs in English, there is a significant gap in performance in non-English languages. In order to address this, we introduce a novel recipe for creating a multilingual synthetic instruction tuning dataset, sPhinX, which is created by selectively translating instruction response pairs from English into 50 languages. We test the effectiveness of sPhinX by using it to fine-tune two state-of-the-art models, Phi-3-small and Mistral-7B and then evaluating them across a comprehensive suite of multilingual benchmarks that test reasoning, question answering, and reading comprehension. Our results show that Phi-3-small and Mistral-7B fine-tuned with sPhinX perform better on an average by 4.2%pt and 5%pt respectively as compared to the baselines. We also devise a strategy to incorporate N-shot examples in each fine-tuning sample which further boosts the performance of these models by 3%pt and 10%pt respectively. Additionally, sPhinX also outperforms other multilingual instruction tuning datasets on the same benchmarks along with being sample efficient and diverse, thereby reducing dataset creation costs. Additionally, instruction tuning with sPhinX does not lead to regression on most standard LLM benchmarks.
翻译:尽管大型语言模型在英语领域取得了显著成功,但在非英语语言上的性能仍存在显著差距。为解决这一问题,我们提出了一种创新的多语言合成指令微调数据集构建方法sPhinX,该方法通过将英语指令-响应对选择性翻译为50种语言而创建。我们通过使用sPhinX对两个先进模型Phi-3-small和Mistral-7B进行微调,并在涵盖推理、问答和阅读理解的多语言基准测试套件上进行评估,验证了sPhinX的有效性。实验结果表明,经sPhinX微调的Phi-3-small和Mistral-7B模型相较于基线模型,平均性能分别提升4.2%和5%。我们还设计了一种在微调样本中融入N-shot示例的策略,使模型性能进一步分别提升3%和10%。此外,sPhinX在相同基准测试中优于其他多语言指令微调数据集,同时具备样本高效性和多样性,从而降低了数据集构建成本。值得注意的是,使用sPhinX进行指令微调不会导致大多数标准LLM基准测试的性能衰退。