Cybersecurity spans multiple interconnected domains, complicating the development of meaningful, labor-relevant benchmarks. Existing benchmarks assess isolated skills rather than integrated performance. We find that pre-trained knowledge of cybersecurity in LLMs does not imply attack and defense abilities, revealing a gap between knowledge and capability. To address this limitation, we present the Cybersecurity AI Benchmark (CAIBench), a modular meta-benchmark framework that allows evaluating LLM models and agents across offensive and defensive cybersecurity domains, taking a step towards meaningfully measuring their labor-relevance. CAIBench integrates five evaluation categories, covering over 10,000 instances: Jeopardy-style CTFs, Attack and Defense CTFs, Cyber Range exercises, knowledge benchmarks, and privacy assessments. Key novel contributions include systematic simultaneous offensive-defensive evaluation, robotics-focused cybersecurity challenges (RCTF2), and privacy-preserving performance assessment (CyberPII-Bench). Evaluation of state-of-the-art AI models reveals saturation on security knowledge metrics (~70\% success) but substantial degradation in multi-step adversarial (A\&D) scenarios (20-40\% success), or worse in robotic targets (22\% success). The combination of framework scaffolding and LLM model choice significantly impacts performance; we find that proper matches improve up to 2.6$\times$ variance in Attack and Defense CTFs. These results demonstrate a pronounced gap between conceptual knowledge and adaptive capability, emphasizing the need for a meta-benchmark.
翻译:网络安全涵盖多个相互关联的领域,这使得开发有意义且与工作实际相关的基准变得复杂。现有基准主要评估孤立技能而非综合性能。我们发现,大型语言模型(LLMs)中预训练的网络安全知识并不等同于攻防能力,揭示了知识与实际能力之间的差距。为应对这一局限,我们提出了网络安全人工智能基准(CAIBench),这是一个模块化的元基准框架,能够在攻防两端的网络安全领域中评估LLM模型与智能体,朝着有意义地衡量其工作实际相关性迈出了一步。CAIBench整合了五个评估类别,涵盖超过10,000个实例:夺旗赛(CTF)式知识竞赛、攻防对抗CTF、网络靶场演练、知识基准测试以及隐私评估。关键创新贡献包括系统性的攻防同步评估、面向机器人技术的网络安全挑战(RCTF2)以及隐私保护性能评估(CyberPII-Bench)。对前沿AI模型的评估显示,其在安全知识指标上趋于饱和(约70%成功率),但在多步骤对抗性(A&D)场景中性能显著下降(20-40%成功率),在机器人目标场景中表现更差(22%成功率)。框架架构与LLM模型选择的组合显著影响性能;我们发现恰当的匹配可使攻防CTF中的性能差异提升高达2.6倍。这些结果凸显了概念性知识与自适应能力之间的显著差距,强调了建立元基准的必要性。