Information-seeking is a core capability for AI agents, requiring them to gather and reason over tool-generated information across long trajectories. However, such multi-step information-seeking tasks remain challenging for agents backed by language models. While process reward models (PRMs) can guide agents by ranking candidate steps at test-time, existing PRMs - designed for short reasoning with binary judgment - cannot capture richer dimensions of information-seeking steps, such as tool interactions and reasoning over tool outputs, nor handle the rapidly growing context in long-horizon tasks. To address these limitations, we introduce PRInTS, a generative PRM trained with dual capabilities: (1) dense scoring based on the PRM's reasoning across multiple dimensions of step quality (e.g., interpretation of tool outputs, tool call informativeness) and (2) trajectory summarization that compresses the growing context while preserving essential information for step evaluation. Extensive evaluations across FRAMES, GAIA (levels 1-3), and WebWalkerQA (easy-hard) benchmarks on multiple models reveal that best-of-n sampling with PRInTS enhances information-seeking in open-source models as well as specialized agents, matching or surpassing frontier models with a much smaller backbone agent and outperforming other strong reward modeling baselines.
翻译:信息检索是AI智能体的核心能力,要求其能够跨长轨迹收集并推理工具生成的信息。然而,这类多步信息检索任务对基于语言模型的智能体仍具挑战性。尽管过程奖励模型可通过在测试时为候选步骤排序来引导智能体,但现有PRMs(专为短程推理与二元判断设计)无法捕捉信息检索步骤中更丰富的维度(如工具交互、工具输出推理),也难以处理长程任务中快速增长的上下文。为解决这些局限,我们提出PRInTS——一种具备双项能力的生成式PRM:(1)基于PRM对步骤质量多维度(如工具输出解读、工具调用信息量)推理的密集评分;(2)对增长中的上下文进行轨迹概括,同时保留步骤评估所需的关键信息。在FRAMES、GAIA(1-3级)和WebWalkerQA(易-难)基准上基于多个模型的广泛评估表明,结合PRInTS的最优N采样可增强开源模型及专用智能体的信息检索能力,以更小的骨干智能体匹配或超越前沿模型,并优于其他强奖励建模基线。