Smart contracts on public blockchains now manage large amounts of value, and vulnerabilities in these systems can lead to substantial losses. As AI agents become more capable at reading, writing, and running code, it is natural to ask how well they can already navigate this landscape, both in ways that improve security and in ways that might increase risk. We introduce EVMbench, an evaluation that measures the ability of agents to detect, patch, and exploit smart contract vulnerabilities. EVMbench draws on 117 curated vulnerabilities from 40 repositories and, in the most realistic setting, uses programmatic grading based on tests and blockchain state under a local Ethereum execution environment. We evaluate a range of frontier agents and find that they are capable of discovering and exploiting vulnerabilities end-to-end against live blockchain instances. We release code, tasks, and tooling to support continued measurement of these capabilities and future work on security.
翻译:公共区块链上的智能合约目前管理着大量资产,这些系统中的漏洞可能导致重大损失。随着AI智能体在读取、编写和运行代码方面能力日益增强,我们很自然地要探究它们在此领域的能力水平——既包括提升安全性的方式,也涵盖可能增加风险的途径。本文提出EVMbench,这是一个用于评估智能体检测、修补和利用智能合约漏洞能力的评测框架。EVMbench整合了来自40个代码库的117个精选漏洞,并在最接近真实场景的设置中,基于本地以太坊执行环境下的测试和区块链状态进行程序化评分。我们对一系列前沿智能体进行评估,发现它们能够端到端地发现并利用实际区块链实例中的漏洞。我们开源了代码、任务和工具,以支持对此类能力的持续评估及未来的安全研究工作。