Large language models (LLMs) are increasingly used for health-related decision support. Yet most evaluations treat diagnosis as a single-shot task with complete information provided upfront, often as a multiple-choice selection. This diverges from clinical practice, where diagnosis is interactive and open-ended, involving sequential hypothesis refinement through targeted questioning. We address this gap. We build MeDxBench, a large-scale benchmark of 4,421 clinical cases across 20 specialties. We further propose MeDxAgent, a multi-agent consultation system for interactive diagnosis, and systematically study its prompt-, flow- and agent-level design choices. MeDxAgent achieves a 10.3% accuracy gain over the baseline on MeDxBench, closing 52.3% of the gap to a full-information oracle. We find that specific design choices: collecting demographics first, passing summarized dialogue for diagnosis, and feeding candidate diagnoses for targeted questioning, improve accuracy, mirroring how physicians reason, though their effect emerges fully only in combination. Code and dataset will be released upon publication.
翻译:大语言模型(LLMs)正越来越多地用于健康相关决策支持。然而,当前大多数评估将诊断视为一次性任务,预先提供完整信息并通常采用多项选择形式。这与临床实践相悖——临床诊断是交互式且开放式的,需要通过定向提问逐步细化假设。我们针对这一差距展开研究。首先构建了包含20个专科4,421例临床病例的大规模基准数据集MeDxBench;继而提出用于交互式诊断的多智能体会诊系统MeDxAgent,系统研究了其提示层、流程层与智能体层的设计选择。MeDxAgent在MeDxBench上较基线实现10.3%的准确率提升,将全信息参考模型的差距缩小52.3%。研究发现特定设计选择——先采集人口学特征、传递经过总结的对话用于诊断、以及将候选诊断结果反馈给定向提问模块——能够提升准确率,这些机制与医生理性推理方式相呼应,但其效果需在组合使用时才完全显现。相关代码与数据集将在论文发表后公开。