EVMbench, released by OpenAI, Paradigm, and OtterSec, is the first large-scale benchmark for AI agents on smart contract security. Its results -- agents detect up to 45.6% of vulnerabilities and exploit 72.2% of a curated subset -- have fueled expectations that fully automated AI auditing is within reach. We identify two limitations: its narrow evaluation scope (14 agent configurations, most models tested on only their vendor scaffold) and its reliance on audit-contest data published before every model's release that models may have seen during training. To address these, we expand to 26 configurations across four model families and three scaffolds, and introduce a contamination-free dataset of 22 real-world security incidents postdating every model's release date. Our evaluation yields three findings: (1) agents' detection results are not stable, with rankings shifting across configurations, tasks, and datasets; (2) on real-world incidents, no agent succeeds at end-to-end exploitation across all 110 agent-incident pairs despite detecting up to 65% of vulnerabilities, contradicting EVMbench's conclusion that discovery is the primary bottleneck; and (3) scaffolding materially affects results, with an open-source scaffold outperforming vendor alternatives by up to 5 percentage points, yet EVMbench does not control for this. These findings challenge the narrative that fully automated AI auditing is imminent. Agents reliably catch well-known patterns and respond strongly to human-provided context, but cannot replace human judgment. For developers, agent scans serve as a pre-deployment check. For audit firms, agents are most effective within a human-in-the-loop workflow where AI handles breadth and human auditors contribute protocol-specific knowledge and adversarial reasoning. Code and data: https://github.com/blocksecteam/ReEVMBench/.
翻译:由OpenAI、Paradigm和OtterSec发布的EVMbench是首个面向智能合约安全的AI智能体大规模基准测试。其结果显示——智能体能检测高达45.6%的漏洞,并在精选子集中成功利用72.2%的漏洞——这引发了人们对完全自动化AI审计即将实现的期待。我们发现了该基准的两点局限性:其评估范围狭窄(仅测试14种智能体配置,且多数模型仅在其供应商提供的框架上进行测试),以及其依赖的审计竞赛数据发布于所有模型发布日期之前,这些数据可能在模型训练过程中已被接触。为解决这些问题,我们将评估扩展至涵盖四个模型系列和三种框架的26种配置,并引入了一个无数据污染的、包含22个晚于所有模型发布日期的真实世界安全事件的数据集。我们的评估得出三个发现:(1)智能体的检测结果并不稳定,其排名在不同配置、任务和数据集间存在波动;(2)在真实世界事件中,尽管能检测高达65%的漏洞,但没有一个智能体能在全部110组智能体-事件配对中实现端到端的漏洞利用,这与EVMbench关于“漏洞发现是主要瓶颈”的结论相矛盾;(3)框架结构显著影响结果,一个开源框架的表现优于供应商提供的替代方案高达5个百分点,而EVMbench并未对此进行控制。这些发现挑战了“完全自动化AI审计即将到来”的论断。智能体能够可靠地捕捉已知漏洞模式,并对人工提供的上下文做出强烈响应,但无法取代人类的专业判断。对于开发者而言,智能体扫描可作为部署前的检查手段。对于审计公司而言,智能体在“人在回路”的工作流程中最为有效,其中AI负责广度覆盖,而人类审计员则贡献协议特定知识和对抗性推理。代码与数据:https://github.com/blocksecteam/ReEVMBench/。