EHRNote-ChatQA: A Benchmark for Evidence-Grounded Multi-Turn Clinical Question Answering over Longitudinal Discharge Summaries

Jiyoun Kim,Muhan Yeo,Eunhye Jang,Jeewon Yang,Hangyul Yoon,Su Ji Lee,Hee Jo Han,Hee-Jae Jung,Doyun Kwon,Jun young Lee,Jaehun Lee,Jung-Oh Lee,Sunjun Kweon,Jong Hak Moon,Daseul Kim,Minjae Cho,Edward Choi

Discharge summaries are crucial clinical documents containing the context of a patient's overall hospital stay, and are routinely reviewed by medical experts for patient readmission, ongoing care, and diagnostic decision-making. When reviewing them, medical experts often must iteratively synthesize information across multiple summaries while verifying the evidence supporting each answer. Although large language models (LLMs) are increasingly explored for clinical question answering, existing benchmarks do not sufficiently reflect this setting: they often evaluate exam-style medical knowledge or focus on single-turn question answering with limited evidence-grounding evaluation. We introduce EHRNote-ChatQA, the first benchmark for evidence-grounded multi-turn clinical question answering over patients' multiple discharge summaries. Built from de-identified MIMIC-IV discharge summaries, EHRNote-ChatQA contains 967 patient-level multi-turn samples spanning one to five notes and 16,072 medical-expert-verified QA pairs (8,036 content questions, each paired with an evidence-grounding question) across eight clinical categories. The benchmark is constructed through an expert-informed pipeline combining discharge-summary structuring schema, expert-curated multi-turn QA templates, and LLM-based generation, followed by review and revision of every single QA sample by 11 medical experts. Benchmarking 22 open- and closed-source LLMs reveals several challenges, including that LLMs struggle more with evidence grounding than content answering, multi-turn errors compound across turns, and single-turn clinical QA performance does not reliably transfer to this setting. These findings establish EHRNote-ChatQA as a rigorous and practical benchmark for evaluating clinical QA systems. The dataset will be made publicly available through PhysioNet credentialed access.

翻译：出院小结是关键临床文档，完整记录患者住院全程背景信息，医疗专家常通过系统回顾这些文档进行再入院评估、持续护理决策及诊疗判断。在审阅过程中，专家往往需要迭代整合多份小结中的信息，同时验证支撑每个答案的证据。尽管大语言模型在临床问答领域的应用日益广泛，但现有基准测试未能充分反映这一实际场景：它们通常评估考试型医学知识，或聚焦于单轮问答而缺乏对证据支撑能力的充分评估。我们提出EHRNote-ChatQA，这是首个面向患者多份出院小结、基于证据支撑的多轮临床问答基准。该基准基于去标识化的MIMIC-IV出院小结构建，包含967个患者级多轮样本（覆盖1至5份诊疗记录）、16,072个经医学专家验证的问答对（含8,036个内容问题及各自配对的证据支撑问题），涵盖八类临床范畴。通过专家指导的流水线方法构建该基准：结合出院小结结构化框架、专家策划的多轮问答模板及基于大语言模型的生成环节，最终由11位医学专家对每个问答样本进行逐条审阅修订。对22个开源与闭源大语言模型的基准测试揭示了若干挑战：模型在证据支撑任务上比内容问答表现更差、多轮问答的错误会跨轮次累积、单轮临床问答性能无法可靠迁移至该场景。这些发现确立了EHRNote-ChatQA作为评估临床问答系统严谨且实用的基准。该数据集将通过PhysioNet认证访问权限公开提供。