Large language models have achieved strong performance on medical reasoning benchmarks, yet their deployment in clinical settings demands rigorous verification to ensure factual accuracy. While reward models offer a scalable approach for reasoning trace verification, existing methods face two limitations: they produce only scalar reward values without explicit justification, and they rely on single-pass retrieval that precludes adaptive knowledge access as verification unfolds. We introduce $\method$, an agentic framework that addresses these limitations by training medical reasoning verifiers to iteratively query external medical corpora during evaluation. Our approach combines tool-augmented verification with an iterative reinforcement learning paradigm that requires only trace-level supervision, alongside an adaptive curriculum mechanism that dynamically adjusts training data distribution. Across four medical reasoning benchmarks, $\method$ achieves substantial gains over existing methods, improving MedQA accuracy by 23.5% and MedXpertQA by 32.0% relative to the base generator in particular. Crucially, $\method$ demonstrates an $\mathbf{8\times}$ reduction in sampling budget requirement compared to prior reward model baselines. These findings establish that grounding verification in dynamically retrieved evidence offers a principled path toward more reliable medical reasoning systems.
翻译:大型语言模型在医疗推理基准测试中已展现出强劲性能,然而其在临床环境中的部署需要进行严格验证以确保事实准确性。尽管奖励模型为推理轨迹验证提供了一种可扩展的方法,但现有方法面临两个局限:它们仅产生标量奖励值而缺乏明确依据,且依赖于单次检索机制,无法在验证过程中进行自适应知识访问。我们提出$\method$,这是一个通过训练医疗推理验证器在评估过程中迭代查询外部医学语料库来解决上述局限的智能体框架。我们的方法将工具增强验证与仅需轨迹级监督的迭代强化学习范式相结合,并引入自适应课程机制以动态调整训练数据分布。在四个医疗推理基准测试中,$\method$相较现有方法取得显著提升,特别是在MedQA上准确率相对基础生成器提高23.5%,在MedXpertQA上提高32.0%。关键的是,$\method$相比先前的奖励模型基线实现了$\mathbf{8\times}$的采样预算需求降低。这些发现表明,将验证过程建立在动态检索证据的基础上,为构建更可靠的医疗推理系统提供了理论可行的路径。