Open source large language models (LLMs) have shown great improvements in recent times. However, many of these models are focused solely on popular spoken languages. We present a high quality dataset of more than 70k prompt-response pairs in 74 languages which consist of human generated prompts and synthetic responses. We use this dataset to train a state-of-the-art open source English LLM to chat multilingually. We evaluate our model on MT-Bench chat benchmarks in 6 languages, finding that our multilingual model outperforms previous state-of-the-art open source LLMs across each language. We further find that training on more multilingual data is beneficial to the performance in a chosen target language (Japanese) compared to simply training on only data in that language. These results indicate the necessity of training on large amounts of high quality multilingual data to make a more accessible LLM.
翻译:开源大型语言模型近期取得了显著进步,然而许多模型仅聚焦于主流口语语言。我们提出一个高质量数据集,包含74种语言的超过7万对提示-回复组合,其中提示由人工生成,回复为合成内容。我们利用该数据集训练一个顶尖开源英文大型语言模型,使其具备多语言对话能力。我们通过MT-Bench基准测试对模型在6种语言上的表现进行评估,发现该多语言模型在每种语言上均优于此前最先进的开源大型语言模型。进一步研究发现,与仅使用目标语言(日语)数据训练相比,采用更多多语言数据训练对提升该语言性能更为有效。这些结果表明,为构建更易用的大型语言模型,必须依赖大规模高质量多语言数据进行训练。