Ten Headache Specialists versus Artificial Intelligence for Clinical Literature Summarization: A Critical Evaluation and Comparison

Alejandro Lozano,Keiko Ihara,Ping-Hao Yang,Carrie E. Robertson,Jennifer Stern,Allan Purdy,Hsiangkuo Yuan,Pengfei Zhang,Yulia Orlova,Olga Fermo,Jennifer Hranilovich,Fred Cohen,Todd J. Schwedt,Jenelle A. Jindal,Serena Yeung-Levy,Chia-Chun Chiang

Summarizing the latest medical literature to guide clinical decision-making is essential for evidence-based medicine and high-quality patient care. Yet clinicians face increasing challenges due to limited time with patients and a rapidly growing volume of published articles. Although retrieval-augmented large language models (LLMs) have shown promise in clinical summarization, human evaluations of their effectiveness in synthesizing broader scientific literature and direct comparisons to expert-written syntheses remain scarce. We constructed a RAG-based agentic AI framework using three state-of-the-art LLMs: Sonnet, GPT-4o, and Llama 3.1. A headache specialist created 13 questions, three for prompt optimization and ten for evaluation. Ten headache specialists across the United States and Canada each wrote a summary for one question, yielding four summaries per question (expert, Sonnet, GPT-4o, and Llama). The experts, blinded to authorship, critically evaluated the summaries, excluding the topic for which they wrote a summary, based on correctness, completeness, conciseness, and clinical utility, scoring each from 1 to 10 using standardized rubrics. They also ranked the summaries by preference and indicated whether they believed each summary was written by an expert or an LLM. Our study, comparing LLM- and expert-written literature summaries evaluated by headache specialists, showed that expert-written summaries were preferred, although experts sometimes found it challenging to distinguish between human- and AI-generated summaries. We also identified key expert-valued features beyond standard evaluation metrics that can guide future refinement of both human and AI literature summarization pipelines.

翻译：为临床决策提供依据的最新医学文献总结对循证医学及高质量患者护理至关重要。然而，临床医生因患者接诊时间有限且发表文章数量快速增长而面临日益严峻的挑战。尽管检索增强型大语言模型（LLMs）在临床总结中展现出潜力，但关于其在整合更广泛科学文献方面的有效性及与专家撰写摘要的直接比较的人工评估仍较为匮乏。我们基于三种最先进的LLM（Sonnet、GPT-4o和Llama 3.1）构建了基于RAG的智能体AI框架。一位头痛专家提出13个问题，其中3个用于提示优化，10个用于评估。来自美国和加拿大的十位头痛专家每人针对一个问题撰写摘要，最终每个问题生成四份摘要（专家、Sonnet、GPT-4o和Llama）。专家在不知晓作者身份的情况下，基于正确性、完整性、简洁性和临床实用性，使用标准化评分表对摘要进行1-10分制评分（排除自身撰写的主题），同时按偏好排序并判断每份摘要由专家还是LLM撰写。本研究通过头痛专家对LLM与专家撰写文献摘要的对比评估发现，专家撰写的摘要更受青睐，但部分专家难以区分人类与AI生成的摘要。此外，我们识别出超越标准评价指标的关键专家重视特征，这些特征可为未来优化人类与AI文献摘要生成流程提供指导。