Retrieval-Augmented Generation (RAG) systems face challenges with complex, multihop questions, and agentic frameworks such as Search-R1 (Jin et al., 2025), which operates iteratively, have been proposed to address these complexities. However, such approaches can introduce inefficiencies, including repetitive retrieval of previously processed information and challenges in contextualizing retrieved results effectively within the current generation prompt. Such issues can lead to unnecessary retrieval turns, suboptimal reasoning, inaccurate answers, and increased token consumption. In this paper, we investigate test-time modifications to the Search-R1 pipeline to mitigate these identified shortcomings. Specifically, we explore the integration of two components and their combination: a contextualization module to better integrate relevant information from retrieved documents into reasoning, and a de-duplication module that replaces previously retrieved documents with the next most relevant ones. We evaluate our approaches using the HotpotQA (Yang et al., 2018) and the Natural Questions (Kwiatkowski et al., 2019) datasets, reporting the exact match (EM) score, an LLM-as-a-Judge assessment of answer correctness, and the average number of turns. Our best-performing variant, utilizing GPT-4.1-mini for contextualization, achieves a 5.6% increase in EM score and reduces the number of turns by 10.5% compared to the Search-R1 baseline, demonstrating improved answer accuracy and retrieval efficiency.
翻译:检索增强生成(RAG)系统在处理复杂的多跳问题时面临挑战,为此研究者提出了迭代式代理框架(如Search-R1(Jin等人,2025))以应对这些复杂性。然而,此类方法可能引入效率低下的问题,包括重复检索已处理信息、以及在当前生成提示中有效整合检索结果的语境化困难。这些问题可能导致不必要的检索轮次、次优推理、答案不准确以及令牌消耗增加。本文研究对Search-R1流程进行测试时修改,以缓解上述缺陷。具体而言,我们探索了两种组件及其组合的集成方案:一个语境化模块,用于将检索文档中的相关信息更有效地融入推理过程;以及一个去重模块,该模块用下一个最相关的文档替换先前已检索的文档。我们使用HotpotQA(Yang等人,2018)和Natural Questions(Kwiatkowski等人,2019)数据集评估所提方法,报告了精确匹配(EM)分数、基于LLM-as-a-Judge的答案正确性评估以及平均检索轮次。我们性能最优的变体采用GPT-4.1-mini进行语境化处理,与Search-R1基线相比,EM分数提升了5.6%,检索轮次减少了10.5%,显著提高了答案准确性与检索效率。