We present a framework for training large language models (LLMs) as diagnostic agents with reinforcement learning, enabling them to manage multi-turn interactive diagnostic processes, adaptively select examinations, and commit to final diagnoses. Unlike instruction-tuned models trained on static data, our method acquires diagnostic strategies through dynamic exploration and outcome-based feedback, mapping evolving patient states to the next optimal examination and subsequent diagnosis. Our contributions include: (i) DiagGym, a diagnostics world model trained with electronic health records, serving as a virtual clinical environment to support closed-loop in-silico training and evaluation for interactive diagnosis; (ii) DiagAgent, trained via end-to-end multi-turn RL to learn dynamic diagnostic policies that optimize both interactive effectiveness and final accuracy; (iii) DiagBench, a multi-center diagnostic benchmark designed to evaluate multi-turn diagnostic interaction trajectories. The benchmark comprises 2.2K physician-validated cases sourced from 4 distinct distributions, alongside 3.3K physician-written rubrics for granular process-oriented evaluation. (iv) Extensive evaluations demonstrate DiagAgent's superior performance across both in-domain and out-of-domain (OOD) settings. DiagAgent significantly outperforms 11 SOTA LLMs and 2 prompt-engineered agents. In the end-to-end setting, it delivers a 11.20% increase in diagnostic accuracy and a 17.58% boost in examination recommendation F1 score, while consistently maintaining SOTA performance across all three external centers. Furthermore, in rubric-based evaluations, it surpasses the next-best model by 7.1% in weighted rubric score. These findings indicate that learning policies in interactive clinical environments confers long-term diagnostic management abilities unattainable through passive training.
翻译:本文提出一种基于强化学习训练大语言模型(LLM)作为诊断智能体的框架,使其能够管理多轮交互式诊断流程、自适应选择检查项目并最终确定诊断。与基于静态数据指令微调的模型不同,本方法通过动态探索和基于结果的反馈获取诊断策略,将动态演化的患者状态映射至最优的下一步检查及后续诊断。我们的贡献包括:(i)DiagGym:基于电子健康记录训练的诊断世界模型,作为支持交互式诊断闭环数字训练与评估的虚拟临床环境;(ii)DiagAgent:通过端到端多轮强化学习训练,学习动态诊断策略以同时优化交互效能与最终准确率;(iii)DiagBench:为评估多轮诊断交互轨迹设计的多中心诊断基准,包含来自4个不同分布的2.2K例医师验证病例,以及3.3K条医师撰写的面向细粒度流程评估的评分细则。(iv)大量实验表明DiagAgent在域内与域外(OOD)场景中均具有卓越性能。该模型显著优于11个SOTA大语言模型及2个提示工程智能体。在端到端设置下,其诊断准确率提升11.20%,检查推荐F1分数提升17.58%,并在所有三个外部中心持续保持SOTA性能。此外,在基于评分细则的评估中,其加权细则得分较次优模型高出7.1%。这些发现表明,在交互式临床环境中学习策略能赋予被动训练无法获得的长期诊断管理能力。