Ontology-Driven and Weakly Supervised Rare Disease Identification from Clinical Notes

from arxiv, Accepted for BMC Medical Informatics and Decision Making, structured abstract in full text, 16 pages, 4 figures (and extra 7 pages, 1 figure in the supplementary material)

Computational text phenotyping is the practice of identifying patients with certain disorders and traits from clinical notes. Rare diseases are challenging to be identified due to few cases available for machine learning and the need for data annotation from domain experts. We propose a method using ontologies and weak supervision, with recent pre-trained contextual representations from Bi-directional Transformers (e.g. BERT). The ontology-based framework includes two steps: (i) Text-to-UMLS, extracting phenotypes by contextually linking mentions to concepts in Unified Medical Language System (UMLS), with a Named Entity Recognition and Linking (NER+L) tool, SemEHR, and weak supervision with customised rules and contextual mention representation; (ii) UMLS-to-ORDO, matching UMLS concepts to rare diseases in Orphanet Rare Disease Ontology (ORDO). The weakly supervised approach is proposed to learn a phenotype confirmation model to improve Text-to-UMLS linking, without annotated data from domain experts. We evaluated the approach on three clinical datasets, MIMIC-III discharge summaries, MIMIC-III radiology reports, and NHS Tayside brain imaging reports from two institutions in the US and the UK, with annotations. The improvements in the precision were pronounced (by over 30% to 50% absolute score for Text-to-UMLS linking), with almost no loss of recall compared to the existing NER+L tool, SemEHR. Results on radiology reports from MIMIC-III and NHS Tayside were consistent with the discharge summaries. The overall pipeline processing clinical notes can extract rare disease cases, mostly uncaptured in structured data (manually assigned ICD codes). We discuss the usefulness of the weak supervision approach and propose directions for future studies.

翻译：计算文本表型分析是从临床记录中识别患有特定疾病和特征患者的实践。由于可用于机器学习的病例数量少且需要领域专家进行数据标注，罕见疾病的识别颇具挑战性。我们提出了一种结合本体论与弱监督的方法，并利用来自双向Transformer（如BERT）的最新预训练上下文表示。该本体论框架包括两个步骤：（i）文本到UMLS，通过上下文关联将提及内容链接到统一医学语言系统（UMLS）中的概念来提取表型，使用命名实体识别与链接（NER+L）工具SemEHR，以及通过自定义规则和上下文提及表示实现的弱监督；（ii）UMLS到ORDO，将UMLS概念与罕见疾病本体论（ORDO）中的罕见疾病进行匹配。我们提出了一种弱监督方法来学习表型确认模型，以改进文本到UMLS的链接，无需领域专家的标注数据。我们在三个临床数据集上对该方法进行了评估：来自美国和英国两个机构的MIMIC-III出院小结、MIMIC-III放射学报告和NHS Tayside脑成像报告，这些数据均带有标注。精准度显著提升（文本到UMLS链接的绝对分数提高超过30%至50%），且与现有NER+L工具SemEHR相比，召回率几乎未损失。来自MIMIC-III和NHS Tayside放射学报告的结果与出院小结一致。处理临床记录的完整流程可提取罕见疾病病例，这些病例大多未在结构化数据（手动分配的ICD编码）中被捕捉。我们讨论了弱监督方法的实用性，并提出了未来研究的方向。