Interactive medical consultation requires an agent to proactively elicit missing clinical evidence under uncertainty. Yet existing evaluations largely remain static or outcome-centric, neglecting the evidence-gathering process. In this work, we propose an interactive evaluation framework that explicitly models the consultation process using a simulated patient and a \rev{simulated reporter} grounded in atomic evidences. Based on this representation, we introduce Information Coverage Rate (ICR) to quantify how completely an agent uncovers necessary evidence during interaction. To support systematic study, we build EviMed, an evidence-based benchmark spanning diverse conditions from common complaints to rare diseases, and evaluate 10 models with varying reasoning abilities. We find that strong diagnostic reasoning does not guarantee effective information collection, and this insufficiency acts as a primary bottleneck limiting performance in interactive settings. To address this, we propose REFINE, a strategy that leverages diagnostic verification to guide the agent in proactively resolving uncertainties. Extensive experiments demonstrate that REFINE consistently outperforms baselines across diverse datasets and facilitates effective model collaboration, enabling smaller agents to achieve superior performance under strong reasoning supervision. Our code can be found at https://github.com/NanshineLoong/EID-Benchmark .
翻译:交互式医疗咨询要求智能体在不确定性下主动获取缺失的临床证据。然而现有评估方法大多停留在静态或结果导向层面,忽视了证据收集过程。本研究提出一种交互式评估框架,通过基于原子证据的模拟患者与模拟报告器对咨询过程进行显式建模。基于此表征,我们引入信息覆盖率指标,用以量化智能体在交互过程中对必要证据的发掘完整度。为支持系统性研究,我们构建了EviMed基准数据集——该证据驱动的基准涵盖从常见症状到罕见疾病的多样化病症,并评估了10种具有不同推理能力的模型。研究发现,强大的诊断推理能力并不能保证有效的信息收集,这种不足成为限制交互场景性能的主要瓶颈。为此,我们提出REFINE策略,该策略利用诊断验证引导智能体主动消解不确定性。大量实验表明,REFINE在不同数据集上均持续超越基线方法,并能促进有效的模型协作——在强推理监督下使较小规模的智能体实现更优性能。相关代码已发布于https://github.com/NanshineLoong/EID-Benchmark。