We evaluate whether frontier LLMs are ready for cybersecurity through a dual-mode benchmark: white-box function-level vulnerability detection (VulnLLM-R, across C/Java/Python) and black-box web application security testing (five production-style applications with 118 ground-truth vulnerabilities across 20+ CWE families, which we will open-source). We test six frontier models (GPT-5.4, Codex~5.3, Claude Opus~4.6, Sonnet~4.6, Gemini~3.1~Pro and Gemini~3~Flash) and two domain-specialized models across four testing paradigms. Our findings are sobering: (1)~every frontier model produces 10-50% false positive rates in white-box detection, systematically over-predicting vulnerabilities; (2)~in black-box testing, frontier models achieve only 4-8% ground-truth coverage, improving to just 10-19% even with external security tools (Playwright MCP, Burp Suite MCP); (3)~structured penetration-testing methodology encoded in domain-specialized agents raises per-family detection above 50%, demonstrating that methodology, not scale, is the primary lever; and (4)~a domain-specialized defense model achieves the highest precision (0.904) and lowest false positive rate (9.7%) among all models, on a single GPU. We identify the absence of structured security testing traces end-to-end request/response sequences, failure-heavy data, and multi-step attack chains as the fundamental training data bottleneck, and propose self-play security testing as a data generation strategy. Our results make the case for vertical foundation models purpose-built for cybersecurity.
翻译:我们通过双模式基准测试评估前沿大语言模型在网络安全领域的适用性:白盒函数级漏洞检测(VulnLLM-R,覆盖C/Java/Python)与黑盒Web应用安全测试(包含五个生产级应用,涵盖20+ CWE家族的118个真实漏洞,将开源发布)。我们测试了六个前沿模型(GPT-5.4、Codex~5.3、Claude Opus~4.6、Sonnet~4.6、Gemini~3.1~Pro 和 Gemini~3~Flash)及两个领域专用模型,采用四种测试范式。研究结果令人警醒:(1)所有前沿模型在白盒检测中均产生10-50%的误报率,系统性过度预测漏洞;(2)在黑盒测试中,前沿模型仅达到4-8%的真实漏洞覆盖率,即使借助外部安全工具(Playwright MCP、Burp Suite MCP)也仅提升至10-19%;(3)领域专用智能代理中编码的结构化渗透测试方法将单家族漏洞检测率提升至50%以上,表明方法论而非模型规模才是关键杠杆;(4)一个领域专用的防御模型在单GPU上实现了所有模型中最高的精确率(0.904)和最低的误报率(9.7%)。我们指出缺乏结构化安全测试轨迹(端到端请求/响应序列)、失败密集型数据和多步骤攻击链是根本性训练数据瓶颈,并提出将自博弈安全测试作为数据生成策略。我们的研究结果为构建专为网络安全设计的垂直基础模型提供了有力支撑。