EHRNote-ChatQA: A Benchmark for Evidence-Grounded Multi-Turn Clinical Question Answering over Longitudinal Discharge Summaries

Jiyoun Kim,Muhan Yeo,Eunhye Jang,Jeewon Yang,Hangyul Yoon,Su Ji Lee,Hee Jo Han,Hee-Jae Jung,Doyun Kwon,Jun young Lee,Jaehun Lee,Jung-Oh Lee,Sunjun Kweon,Jong Hak Moon,Daseul Kim,Minjae Cho,Edward Choi

Discharge summaries are crucial clinical documents containing the context of a patient's overall hospital stay, and are routinely reviewed by medical experts for patient readmission, ongoing care, and diagnostic decision-making. When reviewing them, medical experts often must iteratively synthesize information across multiple summaries while verifying the evidence supporting each answer. Although large language models (LLMs) are increasingly explored for clinical question answering, existing benchmarks do not sufficiently reflect this setting: they often evaluate exam-style medical knowledge or focus on single-turn question answering with limited evidence-grounding evaluation. We introduce EHRNote-ChatQA, the first benchmark for evidence-grounded multi-turn clinical question answering over patients' multiple discharge summaries. Built from de-identified MIMIC-IV discharge summaries, EHRNote-ChatQA contains 967 patient-level multi-turn samples spanning one to five notes and 16,072 medical-expert-verified QA pairs (8,036 content questions, each paired with an evidence-grounding question) across eight clinical categories. The benchmark is constructed through an expert-informed pipeline combining discharge-summary structuring schema, expert-curated multi-turn QA templates, and LLM-based generation, followed by review and revision of every single QA sample by 11 medical experts. Benchmarking 22 open- and closed-source LLMs reveals several challenges, including that LLMs struggle more with evidence grounding than content answering, multi-turn errors compound across turns, and single-turn clinical QA performance does not reliably transfer to this setting. These findings establish EHRNote-ChatQA as a rigorous and practical benchmark for evaluating clinical QA systems. The dataset will be made publicly available through PhysioNet credentialed access.

翻译：出院小结是记录患者住院全程背景的关键临床文档，医学专家需常规审阅此类文书以进行患者再入院评估、持续护理及诊断决策。在审查过程中，医学专家通常需要迭代整合多份小结中的信息，同时验证支撑每个答案的证据。尽管大语言模型正越来越多地被探索用于临床问答，现有基准尚未充分反映这一应用场景：它们通常评估考试型医学知识，或聚焦于单轮问答且缺乏完善的证据溯源评估。我们提出EHRNote-ChatQA，这是首个面向患者多份出院小结的循证多轮临床问答基准。该基准基于去标识化的MIMIC-IV出院小结构建，包含967份患者级别的多轮样本（涵盖1至5份病程记录）及16,072对经医学专家验证的问答对（包括8,036个内容问题及与之配对的证据溯源问题），覆盖八大临床类别。基准通过专家驱动的流程构建，整合了出院小结结构化框架、专家策划的多轮问答模板及基于大语言模型的生成技术，随后由11位医学专家对每个问答样本进行审核修订。对22个开源及闭源大语言模型的基准测试揭示了多项挑战：模型在证据溯源任务上的表现显著弱于内容问答，多轮问答中的错误会随轮次累积，且单轮临床问答性能无法可靠迁移至本场景。这些发现确立了EHRNote-ChatQA作为评估临床问答系统的严谨实用基准。该数据集将通过PhysioNet认证访问公开发布。