We evaluate whether frontier LLMs are ready for cybersecurity through a dual-mode benchmark: white-box function-level vulnerability detection (VulnLLM-R, across C/Java/Python) and black-box web application security testing (five production-style applications with 118 ground-truth vulnerabilities across 20+ CWE families, which we will open-source). We test six frontier models (GPT-5.4, Codex~5.3, Claude Opus~4.7, Sonnet~4.6, Gemini~3.1~Pro and Gemini~3~Flash) and two domain-specialized models across four testing paradigms. Our findings are sobering: (1)~every frontier model produces 10-50% false positive rates in white-box detection, systematically over-predicting vulnerabilities; (2)~in black-box testing, frontier models achieve only 4-8% ground-truth coverage, improving to just 10-19% even with external security tools (Playwright MCP, Burp Suite MCP); (3)~structured penetration-testing methodology encoded in domain-specialized agents raises per-family detection above 50%, demonstrating that methodology, not scale, is the primary lever; and (4)~a domain-specialized defense model achieves the highest precision (0.904) and lowest false positive rate (9.7%) among all models, on a single GPU. We identify the absence of structured security testing traces end-to-end request/response sequences, failure-heavy data, and multi-step attack chains as the fundamental training data bottleneck, and propose self-play security testing as a data generation strategy. Our results make the case for vertical foundation models purpose-built for cybersecurity.
翻译:我们通过双模式基准测试评估前沿大语言模型在网络安全领域的准备程度:白盒函数级漏洞检测(VulnLLM-R,覆盖C/Java/Python语言)与黑盒Web应用安全测试(涵盖5个生产级应用、118个真实漏洞、20余种CWE类别,即将开源)。针对六款前沿模型(GPT-5.4、Codex~5.3、Claude Opus~4.7、Sonnet~4.6、Gemini~3.1~Pro与Gemini~3~Flash)及两款领域专用模型,我们采用四种测试范式。研究结论令人警醒:(1)所有前沿模型在白盒检测中均产生10-50%的假阳性率,系统性过度预测漏洞;(2)在黑盒测试中,前沿模型仅覆盖4-8%的真实漏洞,即便借助外部安全工具(Playwright MCP、Burp Suite MCP)也仅提升至10-19%;(3)编码于领域专用代理中的结构化渗透测试方法将每类漏洞检出率提升至50%以上,表明方法论而非模型规模才是核心驱动因素;(4)领域专用防御模型在单GPU环境下实现了所有模型中的最高精确率(0.904)与最低假阳性率(9.7%)。我们指出:结构化安全测试痕迹(端到端请求/响应序列)、故障密集型数据及多步骤攻击链的缺失是根本性训练数据瓶颈,并提出采用自博弈安全测试作为数据生成策略。本研究结论论证了构建网络安全专用垂直基础模型的必要性。