Prevailing medical AI operates on an unrealistic ''one-shot'' model, diagnosing from a complete patient file. However, real-world diagnosis is an iterative inquiry where Clinicians sequentially ask questions and order tests to strategically gather information while managing cost and time. To address this, we first propose Med-Inquire, a new benchmark designed to evaluate an agent's ability to perform multi-turn diagnosis. Built upon a dataset of real-world clinical cases, Med-Inquire simulates the diagnostic process by hiding a complete patient file behind specialized Patient and Examination agents. They force the agent to proactively ask questions and order tests to gather information piece by piece. To tackle the challenges posed by Med-Inquire, we then introduce EvoClinician, a self-evolving agent that learns efficient diagnostic strategies at test time. Its core is a ''Diagnose-Grade-Evolve'' loop: an Actor agent attempts a diagnosis; a Process Grader agent performs credit assignment by evaluating each action for both clinical yield and resource efficiency; finally, an Evolver agent uses this feedback to update the Actor's strategy by evolving its prompt and memory. Our experiments show EvoClinician outperforms continual learning baselines and other self-evolving agents like memory agents. The code is available at https://github.com/yf-he/EvoClinician
翻译:主流的医疗人工智能运行在一种不现实的“一次性”模型上,即从一份完整的患者档案中进行诊断。然而,现实世界的诊断是一个迭代的询问过程,临床医生会依次提问并安排检查,以策略性地收集信息,同时管理成本和时间。为了解决这个问题,我们首先提出了Med-Inquire,这是一个旨在评估智能体执行多轮诊断能力的新基准。该基准基于真实世界临床病例数据集构建,通过将完整的患者档案隐藏在专门的“患者”和“检查”智能体之后,来模拟诊断过程。这些智能体迫使诊断智能体主动提问并安排检查,以逐条收集信息。为了应对Med-Inquire带来的挑战,我们随后引入了EvoClinician,这是一种在测试时学习高效诊断策略的自进化智能体。其核心是一个“诊断-评分-进化”循环:一个执行者智能体尝试进行诊断;一个过程评分者智能体通过评估每个行动的临床收益和资源效率来执行信用分配;最后,一个进化者智能体利用此反馈,通过进化执行者的提示和记忆来更新其策略。我们的实验表明,EvoClinician的表现优于持续学习基线模型以及其他自进化智能体(如记忆智能体)。代码可在 https://github.com/yf-he/EvoClinician 获取。