This survey focuses in encoder Language Models for solving tasks in the clinical domain in the Spanish language. We review the contributions of 17 corpora focused mainly in clinical tasks, then list the most relevant Spanish Language Models and Spanish Clinical Language models. We perform a thorough comparison of these models by benchmarking them over a curated subset of the available corpora, in order to find the best-performing ones; in total more than 3000 models were fine-tuned for this study. All the tested corpora and the best models are made publically available in an accessible way, so that the results can be reproduced by independent teams or challenged in the future when new Spanish Clinical Language models are created.
翻译:本研究聚焦于用于解决西班牙语临床领域任务的编码器语言模型。我们首先回顾了17个主要针对临床任务的语料库贡献,随后列举了最相关的西班牙语语言模型和西班牙语临床语言模型。通过在一个精选的可用语料库子集上进行基准测试,我们对这些模型进行了全面比较,以确定性能最优的模型;本次研究共微调了超过3000个模型。所有测试过的语料库和最佳模型均已公开可访问的形式发布,以便独立团队能够复现研究结果,或在未来出现新的西班牙语临床语言模型时进行验证。