Verifiers--functions assigning rewards to agent behavior--have been key to AI progress in math, code, and games. However, extending gains to domains without clear-cut success criteria remains a challenge: while humans can recognize desired outcomes, translating this intuition into scalable rules is nontrivial. Multimodal LLMs (MLLMs) offer a promising solution, given their world knowledge, human-preference alignment, and reasoning capabilities. We evaluate MLLM verifiers across web navigation, computer use, and robotics, spanning 13+ models, 28+ designs, and thousands of trajectories from diverse agents. We identify a critical limitation: a strong tendency for MLLMs to over-validate agent behavior--a phenomenon we term agreement bias. This bias is pervasive, resilient to test-time scaling, and can harm applications relying on MLLM judgments/rewards (e.g., self-improvement, steering, online supervision). We discuss several considerations for evaluating and designing MLLM verifiers, and introduce SGV, a lightweight method that better leverages their capabilities by modulating (un)conditional generation. First, an MLLM is elicited to generate broad priors about desired behavior, independent of the data under evaluation. Then, conditioned on self-generated priors, it reasons over and evaluates a candidate trajectory. Our methods yield more human-aligned verifiers, improving failure detection by 25pp and accuracy by 14pp. In self-improvement and online supervision, they boost task completion of a GUI specialist in OSWorld, a diffusion policy in robomimic, and a ReAct agent in VisualWebArena--surpassing the previous state of the art by 20pp. As a byproduct, we release an update of VisualWebArena featuring strong agent baselines, more human-aligned oracles, container parallelism with high fidelity and proper resets, >10x speedups, and VWA-Lite, a 1/3 subset with comparable evaluation fidelity.
翻译:验证器——为智能体行为分配奖励的函数——一直是人工智能在数学、代码和游戏领域取得进展的关键。然而,将这种增益扩展到缺乏明确成功标准的领域仍然是一个挑战:虽然人类能够识别期望的结果,但将这种直觉转化为可扩展的规则并非易事。多模态大语言模型(MLLMs)因其世界知识、与人类偏好的对齐以及推理能力,提供了一个有前景的解决方案。我们在网页导航、计算机使用和机器人学等领域评估了MLLM验证器,涵盖了超过13个模型、超过28种设计以及来自不同智能体的数千条轨迹。我们发现了一个关键局限:MLLM存在强烈倾向于过度验证智能体行为的趋势——我们将这种现象称为认同偏差。这种偏差普遍存在,对测试时缩放具有韧性,并且可能损害依赖MLLM判断/奖励的应用(例如,自我改进、引导、在线监督)。我们讨论了评估和设计MLLM验证器的若干考量,并介绍了SGV,这是一种通过调节(非)条件生成来更好利用其能力的轻量级方法。首先,引导MLLM生成关于期望行为的广泛先验,独立于待评估的数据。然后,以自生成的先验为条件,它对候选轨迹进行推理和评估。我们的方法产生了更符合人类偏好的验证器,将故障检测率提高了25个百分点,准确率提高了14个百分点。在自我改进和在线监督任务中,它们提升了OSWorld中GUI专家、robomimic中扩散策略以及VisualWebArena中ReAct智能体的任务完成率——以20个百分点超越了先前的最优水平。作为副产品,我们发布了VisualWebArena的更新版本,包含强大的智能体基线、更符合人类偏好的评估器、具有高保真度和正确重置的容器并行化、超过10倍的加速,以及VWA-Lite——一个具有可比评估保真度的1/3规模子集。