Localising In-Domain Adaptation of Transformer-Based Biomedical Language Models

In the era of digital healthcare, the huge volumes of textual information generated every day in hospitals constitute an essential but underused asset that could be exploited with task-specific, fine-tuned biomedical language representation models, improving patient care and management. For such specialized domains, previous research has shown that fine-tuning models stemming from broad-coverage checkpoints can largely benefit additional training rounds over large-scale in-domain resources. However, these resources are often unreachable for less-resourced languages like Italian, preventing local medical institutions to employ in-domain adaptation. In order to reduce this gap, our work investigates two accessible approaches to derive biomedical language models in languages other than English, taking Italian as a concrete use-case: one based on neural machine translation of English resources, favoring quantity over quality; the other based on a high-grade, narrow-scoped corpus natively written in Italian, thus preferring quality over quantity. Our study shows that data quantity is a harder constraint than data quality for biomedical adaptation, but the concatenation of high-quality data can improve model performance even when dealing with relatively size-limited corpora. The models published from our investigations have the potential to unlock important research opportunities for Italian hospitals and academia. Finally, the set of lessons learned from the study constitutes valuable insights towards a solution to build biomedical language models that are generalizable to other less-resourced languages and different domain settings.

翻译：在数字医疗时代，医院每天产生的大量文本信息构成了重要但未被充分利用的资产。通过任务特定的微调生物医学语言表征模型，这些信息可被用于改善患者护理与管理。先前研究表明，在此类专业领域中，对源自广泛覆盖范围检查点的模型进行微调，可显著受益于大规模领域内资源的额外训练轮次。然而，对于意大利语等资源匮乏的语言，这类资源往往难以获取，阻碍了本地医疗机构实施领域内适配。为缩小这一差距，本研究以意大利语为具体用例，探索了两种可获取的方法来构建非英语的生物医学语言模型：一种基于英语资源的神经机器翻译（重数量而轻质量），另一种则基于以意大利语原生写作的高质量窄域语料库（重质量而轻数量）。我们的研究表明，在生物医学适配中，数据数量比数据质量更具约束性，但即使处理规模相对有限的语料库时，高质量数据的拼接仍能提升模型性能。本研究发布的模型有望为意大利医院及学术界开启重要的研究机遇。最后，本研究总结的经验教训为构建可迁移至其他资源匮乏语言及不同领域场景的生物医学语言模型提供了宝贵见解。