We introduce a new benchmark for assessing AI models' capabilities and risks in automated software exploitation, focusing on their ability to detect and exploit vulnerabilities in real-world software systems. Using DARPA's AI Cyber Challenge (AIxCC) framework and the Nginx challenge project, a deliberately modified version of the widely used Nginx web server, we evaluate several leading language models, including OpenAI's o1-preview and o1-mini, Anthropic's Claude-3.5-sonnet-20241022 and Claude-3.5-sonnet-20240620, Google DeepMind's Gemini-1.5-pro, and OpenAI's earlier GPT-4o model. Our findings reveal that these models vary significantly in their success rates and efficiency, with o1-preview achieving the highest success rate of 64.71 percent and o1-mini and Claude-3.5-sonnet-20241022 providing cost-effective but less successful alternatives. This benchmark establishes a foundation for systematically evaluating the AI cyber risk posed by automated exploitation tools.
翻译:我们引入了一个新的基准,用于评估人工智能模型在自动化软件漏洞利用方面的能力和风险,重点关注其在真实世界软件系统中检测和利用漏洞的能力。利用DARPA的人工智能网络挑战赛(AIxCC)框架以及Nginx挑战项目(一个经过刻意修改的、广泛使用的Nginx网络服务器版本),我们评估了数款领先的语言模型,包括OpenAI的o1-preview和o1-mini、Anthropic的Claude-3.5-sonnet-20241022和Claude-3.5-sonnet-20240620、Google DeepMind的Gemini-1.5-pro,以及OpenAI早期的GPT-4o模型。我们的研究结果表明,这些模型在成功率和效率方面存在显著差异,其中o1-preview实现了最高的成功率(64.71%),而o1-mini和Claude-3.5-sonnet-20241022则提供了成本效益较高但成功率较低的替代方案。该基准为系统评估自动化漏洞利用工具所带来的人工智能网络风险奠定了基础。