Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops

Agent benchmarks score submissions with outcome verifiers that are typically hand-written and brittle, leaving them open to reward hacking. We audit 1,968 tasks across five terminal-agent benchmarks and find 323 (16%) hackable by frontier models given only the task description. This corrupts both leaderboard rankings and RL training signal, yet the standard response is manual and reactive. We introduce the hacker-fixer loop, a method for building exploit-resistant verifiers without per-task manual patching. The loop alternates three LLM agents: a hacker tries to pass the verifier without solving the task, a fixer patches the verifier to reject each discovered exploit, and a solver confirms the patched verifier still admits legitimate solutions. The loop iterates: each patch reshapes what the verifier rewards, surfacing the next exploit. We further add verifier access, and let patches transfer across tasks, to broaden the exploits the loop discovers. On KernelBench, the loop drives the attack success rate from 62% to 0% on a held-out corpus of publicly reported exploits. We also find that weaker agents in the loop can defend against much stronger hackers: Gemini 3 Flash's loop drives the stronger Gemini 3.1 Pro and Claude Opus 4.7's attack success rate from 76% and 61% to 0% on KernelBench, and Gemini 3.1 Pro's from 39% to 17% on Terminal Bench across 77 tasks. We release Terminal Wrench (323 hackable environments, 3,632 hack trajectories) as a snapshot of the current attack surface, our patched verifiers, the exploits the loop discovered, and our implementation as a basis for future work.

翻译：智能体基准测试通常使用人工编写且脆弱的输出验证器对提交内容进行评分，这为奖励篡改留下了漏洞。我们对五个终端智能体基准测试中的1,968个任务进行了审计，发现其中323个（16%）仅凭任务描述即可被前沿模型破解。这种情况既破坏了排行榜排名，也扰乱了强化学习训练信号，然而标准的应对措施仍停留在手动和被动响应的层面。我们提出了黑客-修复循环方法，这是一种无需对每个任务进行手动修补即可构建抗攻击验证器的技术。该循环交替使用三个大语言模型智能体：黑客试图在无需完成任务的情况下通过验证器，修复者对验证器进行修补以拒绝每种已发现的攻击手段，而求解者则确认修补后的验证器仍能接受合法的解决方案。该循环不断迭代：每次修补都会重塑验证器的奖励机制，从而暴露出下一个攻击漏洞。我们进一步增加了验证器访问权限，并允许修补在不同任务间迁移，以扩大循环发现的攻击范围。在KernelBench上，该循环将公开报告的攻击语料库中的攻击成功率从62%降至0%。我们还发现，循环中的较弱智能体能够防御远强于自身的黑客：Gemini 3 Flash的循环将更强的Gemini 3.1 Pro和Claude Opus 4.7在KernelBench上的攻击成功率从76%和61%降至0%，而Gemini 3.1 Pro的循环在Terminal Bench的77个任务上将攻击成功率从39%降至17%。我们发布了Terminal Wrench（包含323个可被攻击的环境和3,632条攻击轨迹）作为当前攻击面的快照，同时提供了我们修补后的验证器、循环发现的攻击漏洞以及我们的实现代码，为未来研究奠定基础。