Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical Domain

Iker García-Ferrero,Rodrigo Agerri,Aitziber Atutxa Salazar,Elena Cabrio,Iker de la Iglesia,Alberto Lavelli,Bernardo Magnini,Benjamin Molinet,Johana Ramirez-Romero,German Rigau,Jose Maria Villa-Gonzalez,Serena Villata,Andrea Zaninello

from arxiv, LREC-COLING 2024

Research on language technology for the development of medical applications is currently a hot topic in Natural Language Understanding and Generation. Thus, a number of large language models (LLMs) have recently been adapted to the medical domain, so that they can be used as a tool for mediating in human-AI interaction. While these LLMs display competitive performance on automated medical texts benchmarks, they have been pre-trained and evaluated with a focus on a single language (English mostly). This is particularly true of text-to-text models, which typically require large amounts of domain-specific pre-training data, often not easily accessible for many languages. In this paper, we address these shortcomings by compiling, to the best of our knowledge, the largest multilingual corpus for the medical domain in four languages, namely English, French, Italian and Spanish. This new corpus has been used to train Medical mT5, the first open-source text-to-text multilingual model for the medical domain. Additionally, we present two new evaluation benchmarks for all four languages with the aim of facilitating multilingual research in this domain. A comprehensive evaluation shows that Medical mT5 outperforms both encoders and similarly sized text-to-text models for the Spanish, French, and Italian benchmarks, while being competitive with current state-of-the-art LLMs in English.

翻译：面向医学应用的语言技术研究目前是自然语言理解与生成领域的热点课题。为此，近期众多大语言模型被适配至医学领域，可作为人机交互的中介工具。尽管这些大语言模型在自动化医学文本基准测试中展现出竞争性表现，但其预训练与评估均聚焦于单一语言（主要为英语）。这一现象在文本到文本模型中尤为突出，此类模型通常需要大量领域特异性预训练数据，而许多语言难以获取此类资源。本文通过构建目前已知规模最大的四语言（英语、法语、意大利语、西班牙语）医学领域多语料库，有效解决了上述局限性。该语料库用于训练Medical mT5——首个面向医学领域的开源文本到文本多语言模型。此外，我们为全部四种语言构建了两个新型评估基准，旨在推动该领域的多语言研究。综合评估表明，Medical mT5在西班牙语、法语、意大利语基准测试中优于编码器模型及同等规模文本到文本模型，且在英语测试中与当前最优大语言模型具有竞争力。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日