Large language models (LLMs) are increasingly used as judges to evaluate agent performance, particularly in non-verifiable settings where judgments rely on agent trajectories including chain-of-thought (CoT) reasoning. This paradigm implicitly assumes that the agent's CoT faithfully reflects both its internal reasoning and the underlying environment state. We show this assumption is brittle: LLM judges are highly susceptible to manipulation of agent reasoning traces. By systematically rewriting agent CoTs while holding actions and observations fixed, we demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks. We study manipulation strategies spanning style-based approaches that alter only the presentation of reasoning and content-based approaches that fabricate signals of task progress, and find that content-based manipulations are consistently more effective. We evaluate prompting-based techniques and scaling judge-time compute, which reduce but do not fully eliminate susceptibility to manipulation. Our findings reveal a fundamental vulnerability in LLM-based evaluation and highlight the need for judging mechanisms that verify reasoning claims against observable evidence.
翻译:大型语言模型(LLM)越来越多地被用作评估智能体性能的裁判,特别是在不可验证的场景中——这类场景的评估依赖于包含思维链(CoT)推理的智能体轨迹。该范式隐含地假设智能体的思维链忠实地反映了其内部推理过程及底层环境状态。我们证明这一假设是脆弱的:LLM裁判极易受到智能体推理轨迹篡改的影响。通过系统性地重写智能体思维链并保持其行动与观察不变,我们证明仅操纵推理过程就可使当前最先进的视觉语言模型(VLM)裁判在涵盖多样化网页任务的800条轨迹上的误判率提升高达90%。我们研究了从仅改变推理呈现方式的风格型操纵到伪造任务进度信号的内容型操纵等多种策略,发现内容型操纵始终更为有效。我们评估了基于提示的技术与增加裁判计算资源的方法,这些方法虽能降低但对操纵的敏感性并未完全消除。我们的研究揭示了基于LLM评估机制的根本性脆弱点,并强调了需要建立能够依据可观测证据验证推理主张的裁判机制。