Large Language Models (LLMs) are increasingly demonstrating the potential to reach human-level performance in generating clinical summaries from patient-clinician conversations. However, these summaries often focus on patients' biology rather than their preferences, values, wishes, and concerns. To achieve patient-centered care, we propose a new standard for Artificial Intelligence (AI) clinical summarization tasks: Patient-Centered Summaries (PCS). Our objective was to develop a framework to generate PCS that capture patient values and ensure clinical utility and to assess whether current open-source LLMs can achieve human-level performance in this task. We used a mixed-methods process. Two Patient and Public Involvement groups (10 patients and 8 clinicians) in the United Kingdom participated in semi-structured interviews exploring what personal and contextual information should be included in clinical summaries and how it should be structured for clinical use. Findings informed annotation guidelines used by eight clinicians to create gold-standard PCS from 88 atrial fibrillation consultations. Sixteen consultations were used to refine a prompt aligned with the guidelines. Five open-source LLMs (Llama-3.2-3B, Llama-3.1-8B, Mistral-8B, Gemma-3-4B, and Qwen3-8B) generated summaries for 72 consultations using zero-shot and few-shot prompting, evaluated with ROUGE-L, BERTScore, and qualitative metrics. Patients emphasized lifestyle routines, social support, recent stressors, and care values. Clinicians sought concise functional, psychosocial, and emotional context. The best zero-shot performance was achieved by Mistral-8B (ROUGE-L 0.189) and Llama-3.1-8B (BERTScore 0.673); the best few-shot by Llama-3.1-8B (ROUGE-L 0.206, BERTScore 0.683). Completeness and fluency were similar between experts and models, while correctness and patient-centeredness favored human PCS.
翻译:大型语言模型(LLMs)在从患者-临床医生对话中生成临床摘要方面,正日益展现出达到人类水平性能的潜力。然而,这些摘要往往侧重于患者的生物学信息,而非其偏好、价值观、愿望与关切。为实现患者中心化照护,我们为人工智能(AI)临床摘要任务提出了一项新标准:患者中心化摘要(PCS)。我们的目标是开发一个生成PCS的框架,以捕捉患者价值观并确保临床实用性,同时评估当前开源LLMs能否在此任务中达到人类水平性能。我们采用了混合方法流程。英国的两个患者与公众参与小组(10名患者和8名临床医生)参与了半结构化访谈,探讨临床摘要应包含哪些个人与情境信息,以及如何为临床使用进行结构化组织。研究结果形成了标注指南,由八名临床医生依据该指南,从88份心房颤动咨询记录中创建了黄金标准PCS。其中16份咨询记录用于优化与指南对齐的提示模板。五个开源LLM(Llama-3.2-3B、Llama-3.1-8B、Mistral-8B、Gemma-3-4B和Qwen3-8B)通过零样本和少样本提示,为72份咨询记录生成了摘要,并使用ROUGE-L、BERTScore和定性指标进行评估。患者强调生活方式习惯、社会支持、近期压力源及照护价值观。临床医生则寻求简洁的功能性、心理社会及情感背景信息。零样本提示中,Mistral-8B(ROUGE-L 0.189)和Llama-3.1-8B(BERTScore 0.673)表现最佳;少样本提示中,Llama-3.1-8B(ROUGE-L 0.206,BERTScore 0.683)表现最优。在完整性与流畅性方面,专家与模型表现相近,而在正确性与患者中心化程度上,人类生成的PCS更具优势。