Large language models (LLMs) have demonstrated excellent capabilities in the field of biomedical question answering, but their application in real-world clinical consultations still faces core challenges. Single-round consultation systems require patients to describe all symptoms upfront, leading to vague diagnosis with unclear complaints. Traditional multi-turn dialogue models, constrained by static supervised learning, lack flexibility and fail to intelligently extract key clinical information. To address these limitations, we propose \Ours{}, a reinforcement learning (RL)-based multi-agent collaborative framework that models medical consultations as a dynamic decision-making process under uncertainty. The doctor agent continuously optimizes its questioning strategy within the RL framework through multi-turn interactions with the patient agent, dynamically adjusting its information-gathering path based on comprehensive rewards from the Consultation Evaluator. This RL fine-tuning mechanism enables LLMs to autonomously develop interaction strategies aligned with clinical reasoning logic, rather than superficially imitating patterns in existing dialogue data. Notably, we constructed MTMedDialog, the first English multi-turn medical consultation dataset capable of simulating patient interactions. Experiments demonstrate that \Ours{} outperforms existing models in both multi-turn reasoning capability and final diagnostic performance. This approach shows immense practical value by reducing misdiagnosis risks in time-pressured settings, freeing clinicians for complex cases, and pioneering a strategy to optimize medical resource allocation and alleviate workforce shortages. Code and data are available at https://github.com/JarvisUSTC/DoctorAgent-RL
翻译:大语言模型(LLM)在生物医学问答领域已展现出卓越能力,但其在真实临床问诊中的应用仍面临核心挑战。单轮问诊系统要求患者预先描述所有症状,导致主诉不清、诊断模糊。传统的多轮对话模型受限于静态监督学习,缺乏灵活性,无法智能提取关键临床信息。为克服这些局限,我们提出 \Ours{},一种基于强化学习(RL)的多智能体协同框架,将医疗问诊建模为不确定性下的动态决策过程。医生智能体通过与患者智能体的多轮交互,在RL框架内持续优化其提问策略,并依据问诊评估器提供的综合奖励动态调整信息收集路径。这种RL微调机制使LLM能够自主形成符合临床推理逻辑的交互策略,而非仅表面模仿现有对话数据中的模式。值得注意的是,我们构建了首个能够模拟患者交互的英文多轮医疗问诊数据集MTMedDialog。实验表明,\Ours{} 在多轮推理能力和最终诊断性能上均优于现有模型。该方法通过降低时间紧迫场景下的误诊风险、使临床医生能专注于复杂病例,并开创优化医疗资源配置、缓解人力短缺的策略,展现出巨大的实用价值。代码与数据公开于 https://github.com/JarvisUSTC/DoctorAgent-RL