Large language models (LLMs) are increasingly used as judges to evaluate agent performance, particularly in non-verifiable settings where judgments rely on agent trajectories including chain-of-thought (CoT) reasoning. This paradigm implicitly assumes that the agent's CoT faithfully reflects both its internal reasoning and the underlying environment state. We show this assumption is brittle: LLM judges are highly susceptible to manipulation of agent reasoning traces. By systematically rewriting agent CoTs while holding actions and observations fixed, we demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks. We study manipulation strategies spanning style-based approaches that alter only the presentation of reasoning and content-based approaches that fabricate signals of task progress, and find that content-based manipulations are consistently more effective. We evaluate prompting-based techniques and scaling judge-time compute, which reduce but do not fully eliminate susceptibility to manipulation. Our findings reveal a fundamental vulnerability in LLM-based evaluation and highlight the need for judging mechanisms that verify reasoning claims against observable evidence.
翻译:大型语言模型(LLM)越来越多地被用作评估智能体性能的裁判,特别是在不可验证的场景中,这类评估依赖于包含思维链(CoT)推理的智能体轨迹。该范式隐含地假设智能体的CoT忠实地反映了其内部推理和底层环境状态。我们证明这一假设是脆弱的:LLM裁判极易受到智能体推理轨迹的操纵。通过系统性地重写智能体CoT同时保持行动和观察不变,我们证明仅操纵推理就能使最先进的VLM裁判在涵盖多样化网络任务的800条轨迹上的误判率提升高达90%。我们研究了从仅改变推理呈现方式的基于风格的操纵方法,到伪造任务进展信号的基于内容的操纵方法等一系列策略,发现基于内容的操纵方法始终更为有效。我们评估了基于提示的技术和增加裁判计算资源的方案,这些方法虽能降低但无法完全消除对操纵的敏感性。我们的研究结果揭示了基于LLM评估的根本性脆弱点,并强调需要建立能够根据可观测证据验证推理主张的裁判机制。