Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risk of Language Models

Andy K. Zhang,Neil Perry,Riya Dulepet,Eliot Jones,Justin W. Lin,Joey Ji,Celeste Menders,Gashon Hussein,Samantha Liu,Donovan Jasper,Pura Peetathawatchai,Ari Glenn,Vikram Sivashankar,Daniel Zamoshchin,Leo Glikbarg,Derek Askaryar,Mike Yang,Teddy Zhang,Rishi Alluri,Nathan Tran,Rinnara Sangpisit,Polycarpos Yiorkadjis,Kenny Osele,Gautham Raghupathi,Dan Boneh,Daniel E. Ho,Percy Liang

from arxiv, 86 pages, 7 figures

Language Model (LM) agents for cybersecurity that are capable of autonomously identifying vulnerabilities and executing exploits have the potential to cause real-world impact. Policymakers, model providers, and other researchers in the AI and cybersecurity communities are interested in quantifying the capabilities of such agents to help mitigate cyberrisk and investigate opportunities for penetration testing. Toward that end, we introduce Cybench, a framework for specifying cybersecurity tasks and evaluating agents on those tasks. We include 40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties. Each task includes its own description, starter files, and is initialized in an environment where an agent can execute bash commands and observe outputs. Since many tasks are beyond the capabilities of existing LM agents, we introduce subtasks, which break down a task into intermediary steps for more gradated evaluation; we add subtasks for 17 of the 40 tasks. To evaluate agent capabilities, we construct a cybersecurity agent and evaluate 7 models: GPT-4o, Claude 3 Opus, Claude 3.5 Sonnet, Mixtral 8x22b Instruct, Gemini 1.5 Pro, Llama 3 70B Chat, and Llama 3.1 405B Instruct. Without guidance, we find that agents are able to solve only the easiest complete tasks that took human teams up to 11 minutes to solve, with Claude 3.5 Sonnet and GPT-4o having the highest success rates. Finally, subtasks provide more signal for measuring performance compared to unguided runs, with models achieving a 3.2\% higher success rate on complete tasks with subtask-guidance than without subtask-guidance. All code and data are publicly available at https://cybench.github.io

翻译：能够自主识别漏洞并执行攻击的网络安全语言模型（LM）智能体具有产生现实影响的潜力。政策制定者、模型提供商以及人工智能和网络安全领域的其他研究人员，都希望量化此类智能体的能力，以帮助缓解网络风险并探索渗透测试的应用机会。为此，我们提出了Cybench，这是一个用于规范网络安全任务并评估智能体在这些任务上表现的框架。我们纳入了来自4个不同夺旗赛（CTF）的40项专业级任务，这些任务经过筛选，具有时效性、意义性，并涵盖了广泛的难度范围。每个任务都包含其自身的描述、起始文件，并在一个智能体可以执行bash命令并观察输出的环境中初始化。由于许多任务超出了现有LM智能体的能力范围，我们引入了子任务，将一个任务分解为中间步骤以实现更精细的评估；我们在40个任务中的17个上添加了子任务。为了评估智能体能力，我们构建了一个网络安全智能体，并评估了7个模型：GPT-4o、Claude 3 Opus、Claude 3.5 Sonnet、Mixtral 8x22b Instruct、Gemini 1.5 Pro、Llama 3 70B Chat和Llama 3.1 405B Instruct。在没有指导的情况下，我们发现智能体仅能解决那些人类团队花费最多11分钟即可解决的最简单的完整任务，其中Claude 3.5 Sonnet和GPT-4o的成功率最高。最后，与无指导的运行相比，子任务为衡量性能提供了更多信号，模型在子任务指导下的完整任务成功率比无子任务指导时高出3.2%。所有代码和数据均公开于 https://cybench.github.io。