Diagnostic codes for Barrett's esophagus (BE), a precursor to esophageal cancer, lack granularity and precision for many research or clinical use cases. Laborious manual chart review is required to extract key diagnostic phenotypes from BE pathology reports. We developed a generalizable transformer-based method to automate data extraction. Using pathology reports from Columbia University Irving Medical Center with gastroenterologist-annotated targets, we performed binary dysplasia classification as well as granularized multi-class BE-related diagnosis classification. We utilized two clinically pre-trained large language models, with best model performance comparable to a highly tailored rule-based system developed using the same data. Binary dysplasia extraction achieves 0.964 F1-score, while the multi-class model achieves 0.911 F1-score. Our method is generalizable and faster to implement as compared to a tailored rule-based approach.
翻译:巴雷特食管(BE)作为食管癌的前驱病变,其诊断编码在临床研究及应用场景中普遍缺乏颗粒度与精确性。从BE病理报告中提取关键诊断表型需耗费大量人力进行人工病历审核。我们开发了一种基于Transformer的通用方法以实现数据自动化提取。通过使用哥伦比亚大学欧文医学中心的病理报告及消化科医师标注的金标准,我们进行了二元异型增生分类及颗粒化多类别BE相关诊断分类。实验采用两种临床预训练的大规模语言模型,最佳模型性能与基于相同数据深度定制的规则系统相当。二元异型增生提取的F1分数达0.964,多类别分类模型F1分数为0.911。与定制化规则方法相比,本方法具有更好的泛化性且实施效率更高。