Generative AI agents, software systems powered by Large Language Models (LLMs), are emerging as a promising approach to automate cybersecurity tasks. Among the others, penetration testing is a challenging field due to the task complexity and the diverse strategies to simulate cyber-attacks. Despite growing interest and initial studies in automating penetration testing with generative agents, there remains a significant gap in the form of a comprehensive and standard framework for their evaluation and development. This paper introduces AutoPenBench, an open benchmark for evaluating generative agents in automated penetration testing. We present a comprehensive framework that includes 33 tasks, each representing a vulnerable system that the agent has to attack. Tasks are of increasing difficulty levels, including in-vitro and real-world scenarios. We assess the agent performance with generic and specific milestones that allow us to compare results in a standardised manner and understand the limits of the agent under test. We show the benefits of AutoPenBench by testing two agent architectures: a fully autonomous and a semi-autonomous supporting human interaction. We compare their performance and limitations. For example, the fully autonomous agent performs unsatisfactorily achieving a 21% Success Rate (SR) across the benchmark, solving 27% of the simple tasks and only one real-world task. In contrast, the assisted agent demonstrates substantial improvements, with 64% of SR. AutoPenBench allows us also to observe how different LLMs like GPT-4o or OpenAI o1 impact the ability of the agents to complete the tasks. We believe that our benchmark fills the gap with a standard and flexible framework to compare penetration testing agents on a common ground. We hope to extend AutoPenBench along with the research community by making it available under https://github.com/lucagioacchini/auto-pen-bench.
翻译:生成式人工智能智能体(由大型语言模型驱动的软件系统)正成为一种有前景的自动化网络安全任务的方法。其中,渗透测试由于任务复杂性高且模拟网络攻击的策略多样,是一个极具挑战性的领域。尽管利用生成式智能体自动化渗透测试的研究兴趣日益增长并已有初步探索,但在其评估与开发方面,仍缺乏一个全面且标准化的框架。本文介绍了AutoPenBench,一个用于评估自动化渗透测试中生成式智能体的开放基准。我们提出了一个包含33个任务的综合框架,每个任务代表一个智能体需要攻击的脆弱系统。任务难度递增,涵盖体外模拟和真实世界场景。我们通过通用和特定的里程碑来评估智能体性能,使我们能够以标准化方式比较结果,并理解被测智能体的局限性。我们通过测试两种智能体架构展示了AutoPenBench的效用:一种是完全自主的架构,另一种是支持人机交互的半自主架构。我们比较了它们的性能和局限。例如,完全自主的智能体表现不佳,在整个基准测试中仅达到21%的成功率,解决了27%的简单任务,且仅完成一个真实世界任务。相比之下,辅助型智能体表现出显著改进,成功率达到了64%。AutoPenBench还使我们能够观察到不同的大型语言模型(如GPT-4o或OpenAI o1)如何影响智能体完成任务的能力。我们相信,我们的基准填补了空白,提供了一个标准且灵活的框架,能够在共同基础上比较渗透测试智能体。我们希望通过在https://github.com/lucagioacchini/auto-pen-bench上开源,与研究社区共同扩展AutoPenBench。