Agentic AI systems execute a sequence of actions, such as reasoning steps or tool calls, in response to a user prompt. To evaluate the success of their trajectories, researchers have developed verifiers, such as LLM judges and process-reward models, to score the quality of each action in an agent's trajectory. Although these heuristic scores can be informative, there are no guarantees of correctness when used to decide whether an agent will yield a successful output. Here, we introduce e-valuator, a method to convert any black-box verifier score into a decision rule with provable control of false alarm rates. We frame the problem of distinguishing successful trajectories (that is, a sequence of actions that will lead to a correct response to the user's prompt) and unsuccessful trajectories as a sequential hypothesis testing problem. E-valuator builds on tools from e-processes to develop a sequential hypothesis test that remains statistically valid at every step of an agent's trajectory, enabling online monitoring of agents over arbitrarily long sequences of actions. Empirically, we demonstrate that e-valuator provides greater statistical power and better false alarm rate control than other strategies across six datasets and three agents. We additionally show that e-valuator can be used for to quickly terminate problematic trajectories and save tokens. Together, e-valuator provides a lightweight, model-agnostic framework that converts verifier heuristics into decisions rules with statistical guarantees, enabling the deployment of more reliable agentic systems.
翻译:摘要:智能体AI系统会根据用户提示执行一系列动作,例如推理步骤或工具调用。为评估其轨迹的成功与否,研究人员开发了验证器(如大语言模型评判器和过程奖励模型),用于对智能体轨迹中每个动作的质量进行评分。虽然这些启发式评分能提供有用信息,但在用于判断智能体是否会产生成功输出时,其正确性无法保证。本文提出E-valuator方法,该方法可将任意黑盒验证器评分转化为具有可证虚警率控制能力的决策规则。我们将成功轨迹(即能引导出用户提示正确响应的动作序列)与失败轨迹的区分问题建模为序贯假设检验问题。E-valuator基于e-过程工具开发序贯假设检验方法,该方法在智能体轨迹的每一步均保持统计有效性,从而实现对任意长动作序列智能体的在线监控。实验表明,在六个数据集和三个智能体上,E-valuator相比其他策略具有更强的统计功效和更好的虚警率控制能力。我们还展示了E-valuator可快速终止问题轨迹以节省令牌。综上,E-valuator提供了一种轻量级、模型无关的框架,将验证器启发式评分转化为具有统计保证的决策规则,从而支持更可靠智能体系统的部署。