Medical reasoning models remain constrained by parametric knowledge and are thus susceptible to forgetting and hallucinations. DeepResearch (DR) models ground outputs in verifiable evidence from tools and perform strongly in general domains, but their direct transfer to medical field yields relatively limited gains. We attribute this to two gaps: task characteristic and tool-use scaling. Medical questions require evidence interpretation in a knowledge-intensive clinical context; while general DR models can retrieve information, they often lack clinical-context reasoning and thus "find it but fail to use it," leaving performance limited by medical abilities. Moreover, in medical scenarios, blindly scaling tool-call can inject noisy context, derailing sensitive medical reasoning and prompting repetitive evidence-seeking along incorrect paths. Therefore, we propose DeepMed. For data, we deploy a multi-hop med-search QA synthesis method supporting the model to apply the DR paradigm in medical contexts. For training, we introduce a difficulty-aware turn-penalty to suppress excessive tool-call growth. For inference, we bring a monitor to help validate hypotheses within a controlled number of steps and avoid context rot. Overall, on seven medical benchmarks, DeepMed improves its base model by 9.79\% on average and outperforms larger medical reasoning and DR models.
翻译:医学推理模型仍受限于参数化知识,因此易出现遗忘与幻觉问题。深度研究(DR)模型将输出基于可验证的工具证据,在通用领域表现优异,但直接迁移至医学领域带来的增益相对有限。我们将其归因于两个差距:任务特性与工具使用规模。医学问题需要在知识密集的临床语境中进行证据解读;而通用DR模型虽能检索信息,却常缺乏临床语境推理能力,从而“找到证据却无法有效利用”,导致性能受限于医学能力。此外,在医疗场景中,盲目扩展工具调用可能引入噪声语境,干扰敏感的医学推理,并促使模型沿错误路径重复寻求证据。为此,我们提出DeepMed。在数据层面,我们部署了多跳医学搜索问答合成方法,支持模型在医学语境中应用DR范式。在训练层面,我们引入难度感知的轮次惩罚机制以抑制过度工具调用增长。在推理层面,我们引入监控模块帮助在可控步数内验证假设并避免语境腐化。总体而言,在七个医学基准测试中,DeepMed将其基础模型平均提升9.79%,并优于更大的医学推理与DR模型。