Biomedical entity linking, a main component in automatic information extraction from health-related texts, plays a pivotal role in connecting textual entities (such as diseases, drugs and body parts mentioned by patients) to their corresponding concepts in a structured biomedical knowledge base. The task remains challenging despite recent developments in natural language processing. This paper presents the first evaluated biomedical entity linking model for the Dutch language. We use MedRoBERTa.nl as base model and perform second-phase pretraining through self-alignment on a Dutch biomedical ontology extracted from the UMLS and Dutch SNOMED. We derive a corpus from Wikipedia of ontology-linked Dutch biomedical entities in context and fine-tune our model on this dataset. We evaluate our model on the Dutch portion of the Mantra GSC-corpus and achieve 54.7% classification accuracy and 69.8% 1-distance accuracy. We then perform a case study on a collection of unlabeled, patient-support forum data and show that our model is hampered by the limited quality of the preceding entity recognition step. Manual evaluation of small sample indicates that of the correctly extracted entities, around 65% is linked to the correct concept in the ontology. Our results indicate that biomedical entity linking in a language other than English remains challenging, but our Dutch model can be used to for high-level analysis of patient-generated text.
翻译:生物医学实体链接作为从健康相关文本中自动提取信息的关键组件,在将文本中的实体(如患者提及的疾病、药物及身体部位)与其在结构化生物医学知识库中的对应概念建立关联方面发挥着核心作用。尽管自然语言处理领域近期取得了进展,该任务仍具有挑战性。本文提出了首个经过评估的荷兰语生物医学实体链接模型。我们以MedRoBERTa.nl为基础模型,通过从UMLS和荷兰语SNOMED中提取的荷兰语生物医学本体进行自对齐的第二阶段预训练。我们从维基百科中提取了带有本体标注的上下文相关荷兰语生物医学实体语料库,并在此数据集上微调模型。在Mantra GSC语料库的荷兰语部分评估中,我们的模型实现了54.7%的分类准确率和69.8%的1-距离准确率。随后,我们对一组未标注的患者支持论坛数据进行案例研究,结果表明模型性能受限于前期实体识别步骤的有限质量。对样本的手动评估显示,在正确提取的实体中,约65%能够链接至本体的正确概念。研究结果表明,英语以外的生物医学实体链接仍具挑战性,但我们的荷兰语模型可用于患者生成文本的高层次分析。