We study parallel test-time scaling for long-horizon agentic tasks such as agentic search and deep research, where multiple rollouts are generated in parallel and aggregated into a final response. While such scaling has proven effective for chain-of-thought reasoning, agentic tasks pose unique challenges: trajectories are long, multi-turn, and tool-augmented, and outputs are often open-ended. Aggregating only final answers discards rich information from trajectories, while concatenating all trajectories exceeds the model's context window. To address this, we propose AggAgent, an aggregation agent that treats parallel trajectories as an environment. We equip it with lightweight tools to inspect candidate solutions and search across trajectories, enabling it to navigate and synthesize information on demand. Across six benchmarks and three model families (GLM-4.7, Qwen3.5, MiniMax-M2.5), AggAgent outperforms all existing aggregation methods-by up to 5.3% absolute on average and 10.3% on two deep research tasks-while adding minimal overhead, as the aggregation cost remains bounded by a single agentic rollout. Our findings establish agentic aggregation as an effective and cost-efficient approach to parallel test-time scaling.
翻译:我们研究了长周期智能体任务(如智能体搜索和深度研究)的并行测试时扩展,其中并行生成多条轨迹并聚合为最终响应。虽然这种扩展已在思维链推理中被证明有效,但智能体任务提出了独特挑战:轨迹长、多轮交互、依赖工具,且输出往往是开放式的。仅对最终答案进行聚合会丢弃轨迹中的丰富信息,而拼接所有轨迹则会超出模型的上下文窗口。为此,我们提出AggAgent,一种将并行轨迹视为环境的聚合智能体。我们为其配备轻量级工具,用于检查候选解决方案并在轨迹间进行搜索,使其能够按需导航和综合信息。在六个基准测试和三个模型系列(GLM-4.7、Qwen3.5、MiniMax-M2.5)上,AggAgent在所有现有聚合方法中表现最优——平均绝对提升最高达5.3%,在两项深度研究任务上提升达10.3%——同时仅增加极小开销,因为聚合成本被限制在单次智能体轨迹内。我们的研究确立了智能体式聚合作为并行测试时扩展的有效且经济高效的方法。