Evaluating The Performance of Using Large Language Models to Automate Summarization of CT Simulation Orders in Radiation Oncology

Meiyun Cao,Shaw Hu,Jason Sharp,Edward Clouser,Jason Holmes,Linda L. Lam,Xiaoning Ding,Diego Santos Toesca,Wendy S. Lindholm,Samir H. Patel,Sujay A. Vora,Peilong Wang,Wei Liu

Purpose: This study aims to use a large language model (LLM) to automate the generation of summaries from the CT simulation orders and evaluate its performance. Materials and Methods: A total of 607 CT simulation orders for patients were collected from the Aria database at our institution. A locally hosted Llama 3.1 405B model, accessed via the Application Programming Interface (API) service, was used to extract keywords from the CT simulation orders and generate summaries. The downloaded CT simulation orders were categorized into seven groups based on treatment modalities and disease sites. For each group, a customized instruction prompt was developed collaboratively with therapists to guide the Llama 3.1 405B model in generating summaries. The ground truth for the corresponding summaries was manually derived by carefully reviewing each CT simulation order and subsequently verified by therapists. The accuracy of the LLM-generated summaries was evaluated by therapists using the verified ground truth as a reference. Results: About 98% of the LLM-generated summaries aligned with the manually generated ground truth in terms of accuracy. Our evaluations showed an improved consistency in format and enhanced readability of the LLM-generated summaries compared to the corresponding therapists-generated summaries. This automated approach demonstrated a consistent performance across all groups, regardless of modality or disease site. Conclusions: This study demonstrated the high precision and consistency of the Llama 3.1 405B model in extracting keywords and summarizing CT simulation orders, suggesting that LLMs have great potential to help with this task, reduce the workload of therapists and improve workflow efficiency.

翻译：目的：本研究旨在利用大型语言模型（LLM）自动化生成CT模拟定位单摘要并评估其性能。材料与方法：从本机构的Aria数据库中收集了共计607份患者CT模拟定位单。通过应用程序编程接口（API）服务调用本地部署的Llama 3.1 405B模型，用于从CT模拟定位单中提取关键词并生成摘要。下载的CT模拟定位单根据治疗方式和疾病部位分为七组。针对每组数据，与治疗师协作制定了定制化的指令提示，以引导Llama 3.1 405B模型生成摘要。相应的摘要真值通过仔细审阅每份CT模拟定位单人工提取，并由治疗师验证。治疗师以验证后的真值为参考，评估了LLM生成摘要的准确性。结果：约98%的LLM生成摘要在准确性方面与人工生成的真值一致。我们的评估显示，与相应的治疗师生成摘要相比，LLM生成摘要在格式一致性和可读性方面均有所提升。这种自动化方法在所有组别中均表现出稳定的性能，不受治疗方式或疾病部位的影响。结论：本研究证明了Llama 3.1 405B模型在提取关键词和总结CT模拟定位单方面具有高精度和一致性，表明LLM在此任务中具有巨大潜力，有助于减轻治疗师工作量并提升工作流程效率。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日