General-Purpose vs. Domain-Adapted Large Language Models for Extraction of Data from Thoracic Radiology Reports

Radiologists produce unstructured data that could be valuable for clinical care when consumed by information systems. However, variability in style limits usage. Study compares performance of system using domain-adapted language model (RadLing) and general-purpose large language model (GPT-4) in extracting common data elements (CDE) from thoracic radiology reports. Three radiologists annotated a retrospective dataset of 1300 thoracic reports (900 training, 400 test) and mapped to 21 pre-selected relevant CDEs. RadLing was used to generate embeddings for sentences and identify CDEs using cosine-similarity, which were mapped to values using light-weight mapper. GPT-4 system used OpenAI's general-purpose embeddings to identify relevant CDEs and used GPT-4 to map to values. The output CDE:value pairs were compared to the reference standard; an identical match was considered true positive. Precision (positive predictive value) was 96% (2700/2824) for RadLing and 99% (2034/2047) for GPT-4. Recall (sensitivity) was 94% (2700/2876) for RadLing and 70% (2034/2887) for GPT-4; the difference was statistically significant (P<.001). RadLing's domain-adapted embeddings were more sensitive in CDE identification (95% vs 71%) and its light-weight mapper had comparable precision in value assignment (95.4% vs 95.0%). RadLing system exhibited higher performance than GPT-4 system in extracting CDEs from radiology reports. RadLing system's domain-adapted embeddings outperform general-purpose embeddings from OpenAI in CDE identification and its light-weight value mapper achieves comparable precision to large GPT-4. RadLing system offers operational advantages including local deployment and reduced runtime costs. Domain-adapted RadLing system surpasses GPT-4 system in extracting common data elements from radiology reports, while providing benefits of local deployment and lower costs.

翻译：放射科医生生成的非结构化数据在被信息系统使用时，可能对临床诊疗具有宝贵价值。然而，报告风格的差异性限制了其应用。本研究比较了采用领域自适应语言模型（RadLing）和通用型大语言模型（GPT-4）的系统在从胸部放射学报告中提取通用数据元素（CDE）方面的性能。三位放射科医生对1300份胸部报告的回顾性数据集（900份训练集，400份测试集）进行标注，并将其映射至21个预选的相关CDE。RadLing系统用于生成句子嵌入，并通过余弦相似度识别CDE，再使用轻量级映射器将CDE映射至具体值。GPT-4系统利用OpenAI通用嵌入识别相关CDE，并通过GPT-4进行值映射。输出CDE:值对与参考标准进行比较，完全匹配视为真阳性。结果显示：RadLing的精确率（阳性预测值）为96%（2700/2824），GPT-4为99%（2034/2047）；RadLing的召回率（敏感度）为94%（2700/2876），GPT-4为70%（2034/2887），差异具有统计学意义（P<.001）。RadLing的领域自适应嵌入在CDE识别方面具有更高敏感度（95% vs 71%），其轻量级映射器在值分配中的精确率与GPT-4相当（95.4% vs 95.0%）。RadLing系统在从放射学报告中提取CDE方面展现了优于GPT-4系统的性能。RadLing系统的领域自适应嵌入在CDE识别中优于OpenAI通用嵌入，其轻量级值映射器实现了与大型GPT-4相当的精确率。RadLing系统兼具本地部署和降低运行时成本的操作优势。领域自适应的RadLing系统在从放射学报告中提取通用数据元素方面超越GPT-4系统，同时提供本地部署和更低成本的益处。

相关内容

GPT-4

关注 29

北京时间2023年3月15日凌晨，ChatGPT开发商OpenAI 发布了发布了全新的多模态预训练大模型 GPT-4，可以更可靠、更具创造力、能处理更细节的指令，根据图片和文字提示都能生成相应内容。具体来说来说，GPT-4 相比上一代的模型，实现了飞跃式提升：支持图像和文本输入，拥有强大的识图能力；大幅提升了文字输入限制，在ChatGPT模式下，GPT-4可以处理超过2.5万字的文本，可以处理一些更加细节的指令；回答准确性也得到了显著提高。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日