Deep research agents rely on iterative retrieval and reasoning to answer complex queries, but scaling test-time computation raises significant efficiency concerns. We study how to allocate reasoning budget in deep search pipelines, focusing on the role of listwise reranking. Using the BrowseComp-Plus benchmark, we analyze tradeoffs between model scale, reasoning effort, reranking depth, and total token cost via a novel effective token cost (ETC) metric. Our results show that reranking consistently improves retrieval and end-to-end accuracy, and that moderate reranking often yields larger gains than increasing search-time reasoning, achieving comparable accuracy at substantially lower cost. All our code is available at https://github.com/sahel-sh/DeepHone
翻译:深度研究智能体依赖迭代检索与推理回答复杂查询,但扩展测试时计算复杂度会显著降低效率。我们研究了如何分配深度搜索流程中的推理预算,重点关注列表式重排序的作用。通过BrowseComp-Plus基准测试,我们采用新型有效指令成本指标分析了模型规模、推理深度、重排序跨度与总指令成本之间的折衷关系。结果表明,重排序能持续提升检索和端到端准确率,且适度重排序带来的性能提升通常优于增加搜索时推理量,在显著降低成本的同时达到相近准确率。所有代码已开源至https://github.com/sahel-sh/DeepHone