Symptom information is primarily documented in free-text clinical notes and is not directly accessible for downstream applications. To address this challenge, information extraction approaches that can handle clinical language variation across different institutions and specialties are needed. In this paper, we present domain generalization for symptom extraction using pretraining and fine-tuning data that differs from the target domain in terms of institution and/or specialty and patient population. We extract symptom events using a transformer-based joint entity and relation extraction method. To reduce reliance on domain-specific features, we propose a domain generalization method that dynamically masks frequent symptoms words in the source domain. Additionally, we pretrain the transformer language model (LM) on task-related unlabeled texts for better representation. Our experiments indicate that masking and adaptive pretraining methods can significantly improve performance when the source domain is more distant from the target domain.
翻译:症状信息主要记录在自由文本的临床笔记中,无法直接用于下游应用。为解决这一挑战,需要能够处理不同机构与专科间临床语言变体的信息抽取方法。本文提出了一种症状抽取的领域泛化方法,其预训练与微调数据在机构、专科及患者群体方面与目标领域存在差异。我们采用基于Transformer的联合实体与关系抽取方法提取症状事件。为减少对领域特定特征的依赖,我们提出了一种动态遮蔽源域中高频症状词的领域泛化方法。此外,我们在与任务相关的无标注文本上对Transformer语言模型进行预训练,以获得更好的表示。实验表明,当源域与目标域差异较大时,遮蔽与自适应预训练方法能够显著提升性能。