AI has the potential to transform cybersecurity by enabling systems that can autonomously detect, analyze, and remediate software vulnerabilities. However, existing cybersecurity evaluations of AI systems are limited in scale or scope, and fail to capture the end-to-end lifecycle of real-world software vulnerability discovery and remediation. To address this gap, we propose CyberGym-E2E, a large-scale and realistic end-to-end cybersecurity benchmark that comprehensively evaluates AI agents' abilities across the full lifecycle of vulnerability discovery, PoC generation, and patch generation. CyberGym-E2E is comprehensive and scalable, as we build an automated, agent-enhanced pipeline for transforming open-source vulnerability data into realistic evaluation environments. Currently, the benchmark consists of 920 real-world vulnerabilities across 139 different open-source projects.
翻译:人工智能通过使系统能够自主检测、分析及修复软件漏洞,有望彻底改变网络安全领域。然而,现有针对AI系统的网络安全评估在规模或范围上存在局限,且未能涵盖真实世界软件漏洞发现与修复的端到端生命周期。为填补这一空白,我们提出CyberGym-E2E——一个大规模且逼真的端到端网络安全基准,全面评估AI智能体在漏洞发现、概念验证生成及补丁生成全生命周期中的能力。CyberGym-E2E具有全面性和可扩展性,因为我们构建了一个自动化的、由智能体增强的流水线,用于将开源漏洞数据转化为逼真的评估环境。目前,该基准包含来自139个不同开源项目的920个真实世界漏洞。