We evaluate OpenAI's o1-preview and o1-mini models, benchmarking their performance against the earlier GPT-4o model. Our evaluation focuses on their ability to detect vulnerabilities in real-world software by generating structured inputs that trigger known sanitizers. Using DARPA's AI Cyber Challenge (AIxCC) framework and the Nginx challenge project--a deliberately modified version of the widely-used Nginx web server--we create a well-defined yet complex environment for testing LLMs on automated vulnerability detection (AVD) tasks. Our results show that the o1-preview model significantly outperforms GPT-4o in both success rate and efficiency, especially in more complex scenarios.
翻译:我们评估了OpenAI的o1-preview和o1-mini模型,将其性能与早期的GPT-4o模型进行基准比较。我们的评估重点聚焦于这些模型通过生成能触发已知净化器的结构化输入来检测现实软件中漏洞的能力。利用DARPA人工智能网络挑战赛(AIxCC)框架及Nginx挑战项目——一个经过刻意修改的广泛使用的Nginx网络服务器版本——我们构建了一个定义明确但复杂的环境,用于测试大语言模型在自动化漏洞检测任务上的表现。结果表明,o1-preview模型在成功率和效率方面均显著优于GPT-4o,尤其在更复杂的场景中表现更为突出。