Deep research agents rely on iterative retrieval and reasoning to answer complex queries, but scaling test-time computation raises significant efficiency concerns. We study how to allocate reasoning budget in deep search pipelines, focusing on the role of listwise reranking. Using the BrowseComp-Plus benchmark, we analyze tradeoffs between model scale, reasoning effort, reranking depth, and total token cost via a novel effective token cost (ETC) metric. Our results show that reranking consistently improves retrieval and end-to-end accuracy, and that moderate reranking often yields larger gains than increasing search-time reasoning, achieving comparable accuracy at substantially lower cost. All our code is available at https://github.com/texttron/BrowseComp-Plus.git
翻译:深度研究智能体依赖迭代检索与推理来回答复杂查询,但扩展测试时计算会引发显著的效率问题。本研究探讨如何在深度搜索流程中分配推理预算,重点关注列表重排的作用。利用BrowseComp-Plus基准,我们通过新颖的有效令牌成本指标,分析了模型规模、推理投入、重排深度与总令牌成本之间的权衡关系。结果表明:重排能持续提升检索与端到端准确率;适度的重排通常比增加搜索时推理带来更大收益,能以显著更低的成本实现相当的准确率。所有代码已发布于https://github.com/texttron/BrowseComp-Plus.git