Cybersecurity AI Benchmark (CAIBench): A Meta-Benchmark for Evaluating Cybersecurity AI Agents

Cybersecurity spans multiple interconnected domains, complicating the development of meaningful, labor-relevant benchmarks. Existing benchmarks assess isolated skills rather than integrated performance. We find that pre-trained knowledge of cybersecurity in LLMs does not imply attack and defense abilities, revealing a gap between knowledge and capability. To address this limitation, we present the Cybersecurity AI Benchmark (CAIBench), a modular meta-benchmark framework that allows evaluating LLM models and agents across offensive and defensive cybersecurity domains, taking a step towards meaningfully measuring their labor-relevance. CAIBench integrates five evaluation categories, covering over 10,000 instances: Jeopardy-style CTFs, Attack and Defense CTFs, Cyber Range exercises, knowledge benchmarks, and privacy assessments. Key novel contributions include systematic simultaneous offensive-defensive evaluation, robotics-focused cybersecurity challenges (RCTF2), and privacy-preserving performance assessment (CyberPII-Bench). Evaluation of state-of-the-art AI models reveals saturation on security knowledge metrics (~70\% success) but substantial degradation in multi-step adversarial (A\&D) scenarios (20-40\% success), or worse in robotic targets (22\% success). The combination of framework scaffolding and LLM model choice significantly impacts performance; we find that proper matches improve up to 2.6$\times$ variance in Attack and Defense CTFs. These results demonstrate a pronounced gap between conceptual knowledge and adaptive capability, emphasizing the need for a meta-benchmark.

翻译：网络安全涵盖多个相互关联的领域，这使得开发有意义且与工作实际相关的基准变得复杂。现有基准主要评估孤立技能而非综合性能。我们发现，大型语言模型（LLMs）中预训练的网络安全知识并不等同于攻防能力，揭示了知识与实际能力之间的差距。为应对这一局限，我们提出了网络安全人工智能基准（CAIBench），这是一个模块化的元基准框架，能够在攻防两端的网络安全领域中评估LLM模型与智能体，朝着有意义地衡量其工作实际相关性迈出了一步。CAIBench整合了五个评估类别，涵盖超过10,000个实例：夺旗赛（CTF）式知识竞赛、攻防对抗CTF、网络靶场演练、知识基准测试以及隐私评估。关键创新贡献包括系统性的攻防同步评估、面向机器人技术的网络安全挑战（RCTF2）以及隐私保护性能评估（CyberPII-Bench）。对前沿AI模型的评估显示，其在安全知识指标上趋于饱和（约70%成功率），但在多步骤对抗性（A&D）场景中性能显著下降（20-40%成功率），在机器人目标场景中表现更差（22%成功率）。框架架构与LLM模型选择的组合显著影响性能；我们发现恰当的匹配可使攻防CTF中的性能差异提升高达2.6倍。这些结果凸显了概念性知识与自适应能力之间的显著差距，强调了建立元基准的必要性。

相关内容

关注 7103

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日