Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

Andy K. Zhang,Neil Perry,Riya Dulepet,Joey Ji,Celeste Menders,Justin W. Lin,Eliot Jones,Gashon Hussein,Samantha Liu,Donovan Jasper,Pura Peetathawatchai,Ari Glenn,Vikram Sivashankar,Daniel Zamoshchin,Leo Glikbarg,Derek Askaryar,Mike Yang,Teddy Zhang,Rishi Alluri,Nathan Tran,Rinnara Sangpisit,Polycarpos Yiorkadjis,Kenny Osele,Gautham Raghupathi,Dan Boneh,Daniel E. Ho,Percy Liang

from arxiv, 151 pages, 9 figures

Language Model (LM) agents for cybersecurity that are capable of autonomously identifying vulnerabilities and executing exploits have potential to cause real-world impact. Policymakers, model providers, and researchers in the AI and cybersecurity communities are interested in quantifying the capabilities of such agents to help mitigate cyberrisk and investigate opportunities for penetration testing. Toward that end, we introduce Cybench, a framework for specifying cybersecurity tasks and evaluating agents on those tasks. We include 40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties. Each task includes its own description, starter files, and is initialized in an environment where an agent can execute commands and observe outputs. Since many tasks are beyond the capabilities of existing LM agents, we introduce subtasks for each task, which break down a task into intermediary steps for a more detailed evaluation. To evaluate agent capabilities, we construct a cybersecurity agent and evaluate 8 models: GPT-4o, OpenAI o1-preview, Claude 3 Opus, Claude 3.5 Sonnet, Mixtral 8x22b Instruct, Gemini 1.5 Pro, Llama 3 70B Chat, and Llama 3.1 405B Instruct. For the top performing models (GPT-4o and Claude 3.5 Sonnet), we further investigate performance across 4 agent scaffolds (structed bash, action-only, pseudoterminal, and web search). Without subtask guidance, agents leveraging Claude 3.5 Sonnet, GPT-4o, OpenAI o1-preview, and Claude 3 Opus successfully solved complete tasks that took human teams up to 11 minutes to solve. In comparison, the most difficult task took human teams 24 hours and 54 minutes to solve. All code and data are publicly available at https://cybench.github.io.

翻译：能够自主识别漏洞并执行攻击的网络安全语言模型（LM）智能体具有产生现实影响的潜力。政策制定者、模型提供商以及人工智能和网络安全社区的研究人员希望量化此类智能体的能力，以帮助缓解网络风险并探索渗透测试的应用机会。为此，我们引入了Cybench，这是一个用于定义网络安全任务并评估智能体在这些任务上表现的框架。我们纳入了来自4个不同夺旗赛（CTF）的40项专业级任务，这些任务经过筛选，具有时效性、代表性，并覆盖了广泛的难度范围。每个任务都包含其自身的描述、起始文件，并在一个智能体可以执行命令并观察输出的环境中初始化。由于许多任务超出了现有LM智能体的能力范围，我们为每个任务引入了子任务，将任务分解为中间步骤，以便进行更细致的评估。为了评估智能体的能力，我们构建了一个网络安全智能体，并评估了8个模型：GPT-4o、OpenAI o1-preview、Claude 3 Opus、Claude 3.5 Sonnet、Mixtral 8x22b Instruct、Gemini 1.5 Pro、Llama 3 70B Chat和Llama 3.1 405B Instruct。对于表现最佳的模型（GPT-4o和Claude 3.5 Sonnet），我们进一步研究了它们在4种智能体框架（结构化bash、仅动作、伪终端和网络搜索）下的性能。在没有子任务指导的情况下，利用Claude 3.5 Sonnet、GPT-4o、OpenAI o1-preview和Claude 3 Opus的智能体成功解决了人类团队需要长达11分钟才能解决的完整任务。相比之下，最困难的任务人类团队需要24小时54分钟才能解决。所有代码和数据均公开于 https://cybench.github.io。