Clinical Text Summarization: Adapting Large Language Models Can Outperform Human Experts

Dave Van Veen,Cara Van Uden,Louis Blankemeier,Jean-Benoit Delbrouck,Asad Aali,Christian Bluethgen,Anuj Pareek,Malgorzata Polacin,William Collins,Neera Ahuja,Curtis P. Langlotz,Jason Hom,Sergios Gatidis,John Pauly,Akshay S. Chaudhari

from arxiv, 23 pages, 22 figures

Sifting through vast textual data and summarizing key information imposes a substantial burden on how clinicians allocate their time. Although large language models (LLMs) have shown immense promise in natural language processing (NLP) tasks, their efficacy across diverse clinical summarization tasks has not yet been rigorously examined. In this work, we employ domain adaptation methods on eight LLMs, spanning six datasets and four distinct summarization tasks: radiology reports, patient questions, progress notes, and doctor-patient dialogue. Our thorough quantitative assessment reveals trade-offs between models and adaptation methods in addition to instances where recent advances in LLMs may not lead to improved results. Further, in a clinical reader study with six physicians, we depict that summaries from the best adapted LLM are preferable to human summaries in terms of completeness and correctness. Our ensuing qualitative analysis delineates mutual challenges faced by both LLMs and human experts. Lastly, we correlate traditional quantitative NLP metrics with reader study scores to enhance our understanding of how these metrics align with physician preferences. Our research marks the first evidence of LLMs outperforming human experts in clinical text summarization across multiple tasks. This implies that integrating LLMs into clinical workflows could alleviate documentation burden, empowering clinicians to focus more on personalized patient care and other irreplaceable human aspects of medicine.

翻译：从海量文本数据中筛选并提炼关键信息，给临床医生的时间分配带来了沉重负担。尽管大型语言模型（LLMs）在自然语言处理（NLP）任务中展现出巨大潜力，但其在多样化临床摘要任务中的效能尚未经过严格检验。本研究对八种LLM采用领域适配方法，涵盖六个数据集和四项不同摘要任务：放射学报告、患者问题、病程记录及医患对话。通过全面的定量评估，我们揭示了模型与适配方法间的权衡，并发现LLM的最新进展未必总能带来效果提升。此外，在由六名医生参与的临床读者研究中，我们表明最佳适配LLM生成的摘要在完整性和正确性方面优于人类撰写的摘要。随后的定性分析阐明了LLM与人类专家共同面临的挑战。最后，我们将传统定量NLP指标与读者研究评分相关联，以增进对这些指标与医生偏好之间一致性的理解。本研究首次证实LLM在多项临床文本摘要任务中表现优于人类专家。这意味着将LLM整合至临床工作流程可减轻文书负担，使临床医生能够更专注于个性化患者护理及其他不可替代的医学人文领域。