In this paper, we aim to develop an open-source, multilingual language model for medicine, that the benefits a wider, linguistically diverse audience from different regions. In general, we present the contribution from the following aspects: first, for multilingual medical-specific adaptation, we construct a new multilingual medical corpus, that contains approximately 25.5B tokens encompassing 6 main languages, termed as MMedC, that enables auto-regressive training for existing general LLMs. second, to monitor the development of multilingual LLMs in medicine, we propose a new multilingual medical multi-choice question-answering benchmark with rationale, termed as MMedBench; third, we have assessed a number of popular, opensource large language models (LLMs) on our benchmark, along with those further auto-regressive trained on MMedC, as a result, our final model, termed as MMedLM 2, with only 7B parameters, achieves superior performance compared to all other open-source models, even rivaling GPT-4 on MMedBench. We will make the resources publicly available, including code, model weights, and datasets.
翻译:在本文中,我们旨在开发一个开源的多语言医学语言模型,以惠及来自不同地区、语言多样化的更广泛受众。总体而言,我们从以下几个方面做出贡献:首先,针对多语言医学特定适应,我们构建了一个新的多语言医学语料库,包含约255亿词元,涵盖6种主要语言,命名为MMedC,该语料库能够对现有通用大语言模型进行自回归训练;其次,为监测医学多语言大语言模型的发展,我们提出了一个带有推理依据的新型多语言医学多选题问答基准,命名为MMedBench;第三,我们在该基准上评估了多个流行的开源大语言模型,以及那些在MMedC上进一步进行自回归训练的模型,结果显示,我们的最终模型MMedLM 2仅含70亿参数,在MMedBench上性能优于所有其他开源模型,甚至可与GPT-4媲美。我们将公开相关资源,包括代码、模型权重和数据集。