Agentic large language models (LLMs) are increasingly evaluated on cybersecurity tasks using capture-the-flag (CTF) benchmarks. However, existing pointwise benchmarks have limited ability to shed light on the robustness and generalisation abilities of agents across alternative versions of the source code. We introduce CTF challenge families, whereby a single CTF is used as the basis for generating a family of semantically-equivalent challenges via semantics-preserving program transformations. This enables controlled evaluation of agent robustness to source code transformations while keeping the underlying exploit strategy fixed. We introduce a new tool, Evolve-CTF, that generates CTF families from Python challenges using a range of transformations. Using Evolve-CTF to derive families from Cybench and Intercode challenges, we evaluate 13 agentic LLM configurations with tool access. We find that models are remarkably robust to intrusive renaming and code insertion-based transformations, but that composed transformations and deeper obfuscation affect performance by requiring more sophisticated use of tools. We also find that enabling explicit reasoning has little effect on solution success rates across challenge families. Our work contributes a valuable technique and tool for future LLM evaluations, and a large dataset characterising the capabilities of current state-of-the-art models in this domain.
翻译:智能大语言模型在网络安全任务中的评估日益依赖于夺旗赛基准测试。然而,现有的点式基准测试在揭示智能体面对源代码不同变体时的鲁棒性与泛化能力方面存在局限。我们提出了夺旗挑战家族的概念:通过语义保持的程序变换,以单个夺旗赛挑战为基础生成一系列语义等价的挑战家族。这种方法能够在保持底层利用策略不变的前提下,实现对智能体应对源代码变换鲁棒性的受控评估。我们开发了一款新工具Evolve-CTF,该工具可运用多种变换方法从Python挑战中生成夺旗挑战家族。通过使用Evolve-CTF从Cybench和Intercode挑战中衍生家族,我们评估了13种具备工具调用能力的智能大语言模型配置。研究发现,模型对侵入式重命名和基于代码插入的变换表现出显著鲁棒性,但复合变换与深度混淆会因需要更复杂的工具使用而影响性能。同时发现,启用显式推理对跨挑战家族的解题成功率影响甚微。本研究为未来大语言模型评估提供了一种有价值的技术方法和工具,并构建了一个大规模数据集,用以刻画当前该领域最先进模型的能力特征。