Diagnosis extraction from unstructured Dutch echocardiogram reports using span- and document-level characteristic classification

Clinical machine learning research and AI driven clinical decision support models rely on clinically accurate labels. Manually extracting these labels with the help of clinical specialists is often time-consuming and expensive. This study tests the feasibility of automatic span- and document-level diagnosis extraction from unstructured Dutch echocardiogram reports. We included 115,692 unstructured echocardiogram reports from the UMCU a large university hospital in the Netherlands. A randomly selected subset was manually annotated for the occurrence and severity of eleven commonly described cardiac characteristics. We developed and tested several automatic labelling techniques at both span and document levels, using weighted and macro F1-score, precision, and recall for performance evaluation. We compared the performance of span labelling against document labelling methods, which included both direct document classifiers and indirect document classifiers that rely on span classification results. The SpanCategorizer and MedRoBERTa$.$nl models outperformed all other span and document classifiers, respectively. The weighted F1-score varied between characteristics, ranging from 0.60 to 0.93 in SpanCategorizer and 0.96 to 0.98 in MedRoBERTa$.$nl. Direct document classification was superior to indirect document classification using span classifiers. SetFit achieved competitive document classification performance using only 10% of the training data. Utilizing a reduced label set yielded near-perfect document classification results. We recommend using our published SpanCategorizer and MedRoBERTa$.$nl models for span- and document-level diagnosis extraction from Dutch echocardiography reports. For settings with limited training data, SetFit may be a promising alternative for document classification.

翻译：临床机器学习研究与人工智能驱动的临床决策支持模型依赖于临床准确的标签。在临床专家的帮助下手动提取这些标签通常耗时且昂贵。本研究测试了从荷兰语非结构化超声心动图报告中自动进行片段级与文档级诊断提取的可行性。我们纳入了来自荷兰大型大学医院UMCU的115,692份非结构化超声心动图报告。随机选取的子集由人工标注了十一种常见心脏特征的出现情况与严重程度。我们在片段级和文档级开发并测试了多种自动标注技术，使用加权与宏平均F1分数、精确率和召回率进行性能评估。我们比较了片段标注方法与文档标注方法的性能，后者包括直接文档分类器以及依赖片段分类结果的间接文档分类器。SpanCategorizer与MedRoBERTa$.$nl模型分别在片段分类器和文档分类器中表现最优。加权F1分数因特征而异，在SpanCategorizer中范围为0.60至0.93，在MedRoBERTa$.$nl中范围为0.96至0.98。直接文档分类优于使用片段分类器的间接文档分类。SetFit仅使用10%的训练数据即实现了具有竞争力的文档分类性能。使用简化标签集可获得近乎完美的文档分类结果。我们推荐使用已发布的SpanCategorizer与MedRoBERTa$.$nl模型进行荷兰语超声心动图报告的片段级与文档级诊断提取。对于训练数据有限的应用场景，SetFit可作为文档分类的潜在替代方案。