Research on language technology for the development of medical applications is currently a hot topic in Natural Language Understanding and Generation. Thus, a number of large language models (LLMs) have recently been adapted to the medical domain, so that they can be used as a tool for mediating in human-AI interaction. While these LLMs display competitive performance on automated medical texts benchmarks, they have been pre-trained and evaluated with a focus on a single language (English mostly). This is particularly true of text-to-text models, which typically require large amounts of domain-specific pre-training data, often not easily accessible for many languages. In this paper, we address these shortcomings by compiling, to the best of our knowledge, the largest multilingual corpus for the medical domain in four languages, namely English, French, Italian and Spanish. This new corpus has been used to train Medical mT5, the first open-source text-to-text multilingual model for the medical domain. Additionally, we present two new evaluation benchmarks for all four languages with the aim of facilitating multilingual research in this domain. A comprehensive evaluation shows that Medical mT5 outperforms both encoders and similarly sized text-to-text models for the Spanish, French, and Italian benchmarks, while being competitive with current state-of-the-art LLMs in English.
翻译:面向医学应用的语言技术研究目前是自然语言理解与生成领域的热点课题。为此,近期众多大语言模型被适配至医学领域,可作为人机交互的中介工具。尽管这些大语言模型在自动化医学文本基准测试中展现出竞争性表现,但其预训练与评估均聚焦于单一语言(主要为英语)。这一现象在文本到文本模型中尤为突出,此类模型通常需要大量领域特异性预训练数据,而许多语言难以获取此类资源。本文通过构建目前已知规模最大的四语言(英语、法语、意大利语、西班牙语)医学领域多语料库,有效解决了上述局限性。该语料库用于训练Medical mT5——首个面向医学领域的开源文本到文本多语言模型。此外,我们为全部四种语言构建了两个新型评估基准,旨在推动该领域的多语言研究。综合评估表明,Medical mT5在西班牙语、法语、意大利语基准测试中优于编码器模型及同等规模文本到文本模型,且在英语测试中与当前最优大语言模型具有竞争力。