In this paper, we aim to develop an open-source, multilingual language model for medicine, that the benefits a wider, linguistically diverse audience from different regions. In general, we present the contribution from the following aspects: first, for multilingual medical-specific adaptation, we construct a new multilingual medical corpus, that contains approximately 25.5B tokens encompassing 6 main languages, termed as MMedC, that enables auto-regressive training for existing general LLMs. second, to monitor the development of multilingual LLMs in medicine, we propose a new multilingual medical multi-choice question-answering benchmark with rationale, termed as MMedBench; third, we have assessed a number of popular, opensource large language models (LLMs) on our benchmark, along with those further auto-regressive trained on MMedC, as a result, our final model, termed as MMedLM 2, with only 7B parameters, achieves superior performance compared to all other open-source models, even rivaling GPT-4 on MMedBench. We will make the resources publicly available, including code, model weights, and datasets.
翻译:本文旨在开发一款开源的多语言医学语言模型,以使不同地区、语言多样化的更广泛受众受益。总体而言,我们从以下方面做出贡献:首先,针对多语言医学领域的特定适应性,我们构建了一个新的多语言医学语料库,包含约255亿个标记,涵盖6种主要语言,命名为MMedC,该语料库支持对现有通用大语言模型进行自回归训练;其次,为监控医学多语言大语言模型的发展,我们提出了一个新的带推理过程的多语言医学多选题问答基准,命名为MMedBench;第三,我们在该基准上评估了多个流行的开源大语言模型,以及那些在MMedC上进一步自回归训练的模型。结果表明,我们的最终模型MMedLM 2仅70亿参数,其性能即超越所有其他开源模型,甚至在MMedBench上与GPT-4相媲美。我们将公开相关资源,包括代码、模型权重和数据集。