AI Cyber Risk Benchmark: Automated Exploitation Capabilities

We introduce a new benchmark for assessing AI models' capabilities and risks in automated software exploitation, focusing on their ability to detect and exploit vulnerabilities in real-world software systems. Using DARPA's AI Cyber Challenge (AIxCC) framework and the Nginx challenge project, a deliberately modified version of the widely used Nginx web server, we evaluate several leading language models, including OpenAI's o1-preview and o1-mini, Anthropic's Claude-3.5-sonnet-20241022 and Claude-3.5-sonnet-20240620, Google DeepMind's Gemini-1.5-pro, and OpenAI's earlier GPT-4o model. Our findings reveal that these models vary significantly in their success rates and efficiency, with o1-preview achieving the highest success rate of 64.71 percent and o1-mini and Claude-3.5-sonnet-20241022 providing cost-effective but less successful alternatives. This benchmark establishes a foundation for systematically evaluating the AI cyber risk posed by automated exploitation tools.

翻译：我们引入了一个新的基准，用于评估人工智能模型在自动化软件漏洞利用方面的能力和风险，重点关注其在真实世界软件系统中检测和利用漏洞的能力。利用DARPA的人工智能网络挑战赛（AIxCC）框架以及Nginx挑战项目（一个经过刻意修改的、广泛使用的Nginx网络服务器版本），我们评估了数款领先的语言模型，包括OpenAI的o1-preview和o1-mini、Anthropic的Claude-3.5-sonnet-20241022和Claude-3.5-sonnet-20240620、Google DeepMind的Gemini-1.5-pro，以及OpenAI早期的GPT-4o模型。我们的研究结果表明，这些模型在成功率和效率方面存在显著差异，其中o1-preview实现了最高的成功率（64.71%），而o1-mini和Claude-3.5-sonnet-20241022则提供了成本效益较高但成功率较低的替代方案。该基准为系统评估自动化漏洞利用工具所带来的人工智能网络风险奠定了基础。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日