Radiologists produce unstructured data that could be valuable for clinical care when consumed by information systems. However, variability in style limits usage. Study compares performance of system using domain-adapted language model (RadLing) and general-purpose large language model (GPT-4) in extracting common data elements (CDE) from thoracic radiology reports. Three radiologists annotated a retrospective dataset of 1300 thoracic reports (900 training, 400 test) and mapped to 21 pre-selected relevant CDEs. RadLing was used to generate embeddings for sentences and identify CDEs using cosine-similarity, which were mapped to values using light-weight mapper. GPT-4 system used OpenAI's general-purpose embeddings to identify relevant CDEs and used GPT-4 to map to values. The output CDE:value pairs were compared to the reference standard; an identical match was considered true positive. Precision (positive predictive value) was 96% (2700/2824) for RadLing and 99% (2034/2047) for GPT-4. Recall (sensitivity) was 94% (2700/2876) for RadLing and 70% (2034/2887) for GPT-4; the difference was statistically significant (P<.001). RadLing's domain-adapted embeddings were more sensitive in CDE identification (95% vs 71%) and its light-weight mapper had comparable precision in value assignment (95.4% vs 95.0%). RadLing system exhibited higher performance than GPT-4 system in extracting CDEs from radiology reports. RadLing system's domain-adapted embeddings outperform general-purpose embeddings from OpenAI in CDE identification and its light-weight value mapper achieves comparable precision to large GPT-4. RadLing system offers operational advantages including local deployment and reduced runtime costs. Domain-adapted RadLing system surpasses GPT-4 system in extracting common data elements from radiology reports, while providing benefits of local deployment and lower costs.
翻译:放射科医生生成的非结构化数据在被信息系统使用时,可能对临床诊疗具有宝贵价值。然而,报告风格的差异性限制了其应用。本研究比较了采用领域自适应语言模型(RadLing)和通用型大语言模型(GPT-4)的系统在从胸部放射学报告中提取通用数据元素(CDE)方面的性能。三位放射科医生对1300份胸部报告的回顾性数据集(900份训练集,400份测试集)进行标注,并将其映射至21个预选的相关CDE。RadLing系统用于生成句子嵌入,并通过余弦相似度识别CDE,再使用轻量级映射器将CDE映射至具体值。GPT-4系统利用OpenAI通用嵌入识别相关CDE,并通过GPT-4进行值映射。输出CDE:值对与参考标准进行比较,完全匹配视为真阳性。结果显示:RadLing的精确率(阳性预测值)为96%(2700/2824),GPT-4为99%(2034/2047);RadLing的召回率(敏感度)为94%(2700/2876),GPT-4为70%(2034/2887),差异具有统计学意义(P<.001)。RadLing的领域自适应嵌入在CDE识别方面具有更高敏感度(95% vs 71%),其轻量级映射器在值分配中的精确率与GPT-4相当(95.4% vs 95.0%)。RadLing系统在从放射学报告中提取CDE方面展现了优于GPT-4系统的性能。RadLing系统的领域自适应嵌入在CDE识别中优于OpenAI通用嵌入,其轻量级值映射器实现了与大型GPT-4相当的精确率。RadLing系统兼具本地部署和降低运行时成本的操作优势。领域自适应的RadLing系统在从放射学报告中提取通用数据元素方面超越GPT-4系统,同时提供本地部署和更低成本的益处。