Large language models (LLMs) show promise in drafting responses to patient portal messages, yet their integration into clinical workflows raises various concerns, including whether they would actually save clinicians time and effort in their portal workload. We investigate LLM alignment with individual clinicians through a comprehensive evaluation of the patient message response drafting task. We develop a novel taxonomy of thematic elements in clinician responses and propose a novel evaluation framework for assessing clinician editing load of LLM-drafted responses at both content and theme levels. We release an expert-annotated dataset and conduct large-scale evaluations of local and commercial LLMs using various adaptation techniques including thematic prompting, retrieval-augmented generation, supervised fine-tuning, and direct preference optimization. Our results reveal substantial epistemic uncertainty in aligning LLM drafts with clinician responses. While LLMs demonstrate capability in drafting certain thematic elements, they struggle with clinician-aligned generation in other themes, particularly question asking to elicit further information from patients. Theme-driven adaptation strategies yield improvements across most themes. Our findings underscore the necessity of adapting LLMs to individual clinician preferences to enable reliable and responsible use in patient-clinician communication workflows.
翻译:大型语言模型(LLM)在起草患者门户消息回复方面展现出潜力,但其融入临床工作流程引发了诸多担忧,包括其是否真能节省临床医生在门户工作上的时间和精力。我们通过对患者消息回复起草任务的全面评估,研究了LLM与个体临床医生的对齐性。我们开发了一种新颖的临床医生回复主题要素分类法,并提出了一种新颖的评估框架,用于在内容和主题层面评估临床医生对LLM生成回复的编辑负担。我们发布了一个专家标注的数据集,并使用多种适应技术(包括主题提示、检索增强生成、监督微调和直接偏好优化)对本地和商业LLM进行了大规模评估。我们的结果揭示了在使LLM草稿与临床医生回复对齐方面存在显著的认知不确定性。虽然LLM在起草某些主题要素方面表现出能力,但在其他主题上难以生成与临床医生对齐的内容,尤其是在向患者提问以获取更多信息方面。主题驱动的适应策略在大多数主题上带来了改进。我们的发现强调了使LLM适应个体临床医生偏好的必要性,以实现在患者-临床医生沟通工作流程中可靠且负责任的使用。