Large Language Models (LLMs) exhibit high reasoning capacity in medical question-answering, but their tendency to produce hallucinations and outdated knowledge poses critical risks in healthcare fields. While Retrieval-Augmented Generation (RAG) mitigates these issues, existing methods rely on noisy token-level signals and lack the multi-round refinement required for complex reasoning. In the paper, we propose MA-RAG (Multi-Round Agentic RAG), a framework that facilitates test-time scaling for complex medical reasoning by iteratively evolving both external evidence and internal reasoning history within an agentic refinement loop. At each round, the agent transforms semantic conflict among candidate responses into actionable queries to retrieve external evidence, while optimizing history reasoning traces to mitigate long-context degradation. MA-RAG extends the self-consistency principle by leveraging the lack of consistency as a proactive signal for multi-round agentic reasoning and retrieval, and mirrors a boosting mechanism that iteratively minimizes the residual error toward a stable, high-fidelity medical consensus. Extensive evaluations across 7 medical Q&A benchmarks show that MA-RAG consistently surpasses competitive inference-time scaling and RAG baselines, delivering substantial +6.8 points on average accuracy over the backbone model. Our code is available at https://github.com/NJU-RL/MA-RAG.
翻译:大语言模型在医疗问答中展现出较高的推理能力,但其易产生幻觉和过时知识的问题在医疗领域存在重大风险。虽然检索增强生成可缓解这些问题,但现有方法依赖带噪声的token级信号,且缺乏复杂推理所需的多轮优化。本文提出MA-RAG(多轮智能体RAG)框架,通过在智能体优化循环中迭代演化外部证据和内部推理历史,实现复杂医疗推理的测试时拓展。在每轮循环中,智能体将候选回答间的语义冲突转化为可执行查询以获取外部证据,同时优化历史推理轨迹以缓解长上下文退化。MA-RAG通过将一致性缺失作为主动性信号扩展了自一致性原则,实现了多轮智能体推理与检索,并镜像出通过迭代最小化残差误差以达成稳定高保真医疗共识的增强机制。在7个医疗问答基准上的全面评估表明,MA-RAG持续超越竞争性的推理性拓展与RAG基线,相较于骨干模型平均准确率提升6.8%。我们的代码已开源至https://github.com/NJU-RL/MA-RAG。