AutoPT: How Far Are We from the End2End Automated Web Penetration Testing?

Penetration testing is essential to ensure Web security, which can detect and fix vulnerabilities in advance, and prevent data leakage and serious consequences. The powerful inference capabilities of large language models (LLMs) have made significant progress in various fields, and the development potential of LLM-based agents can revolutionize the cybersecurity penetration testing industry. In this work, we establish a comprehensive end-to-end penetration testing benchmark using a real-world penetration testing environment to explore the capabilities of LLM-based agents in this domain. Our results reveal that the agents are familiar with the framework of penetration testing tasks, but they still face limitations in generating accurate commands and executing complete processes. Accordingly, we summarize the current challenges, including the difficulty of maintaining the entire message history and the tendency for the agent to become stuck. Based on the above insights, we propose a Penetration testing State Machine (PSM) that utilizes the Finite State Machine (FSM) methodology to address these limitations. Then, we introduce AutoPT, an automated penetration testing agent based on the principle of PSM driven by LLMs, which utilizes the inherent inference ability of LLM and the constraint framework of state machines. Our evaluation results show that AutoPT outperforms the baseline framework ReAct on the GPT-4o mini model and improves the task completion rate from 22% to 41% on the benchmark target. Compared with the baseline framework and manual work, AutoPT also reduces time and economic costs further. Hence, our AutoPT has facilitated the development of automated penetration testing and significantly impacted both academia and industry.

翻译：渗透测试对于保障Web安全至关重要，它能够提前发现并修复漏洞，防止数据泄露和严重后果。大型语言模型（LLM）强大的推理能力已在多个领域取得显著进展，基于LLM的智能体发展潜力有望彻底改变网络安全渗透测试行业。本研究通过构建一个基于真实渗透测试环境的综合性端到端测试基准，以探索基于LLM的智能体在该领域的能力。我们的结果表明，智能体虽然熟悉渗透测试任务的基本框架，但在生成精确命令与执行完整流程方面仍存在局限。据此，我们总结了当前面临的挑战，包括难以维持完整的消息历史记录以及智能体易陷入停滞状态。基于以上发现，我们提出了一种渗透测试状态机（PSM），该方法利用有限状态机（FSM）原理来解决这些局限。随后，我们介绍了AutoPT——一个基于PSM原理、由LLM驱动的自动化渗透测试智能体，它结合了LLM固有的推理能力与状态机的约束框架。评估结果显示，在GPT-4o mini模型上，AutoPT的表现优于基线框架ReAct，并将基准测试目标的任务完成率从22%提升至41%。与基线框架及人工操作相比，AutoPT还进一步降低了时间和经济成本。因此，我们的AutoPT推动了自动化渗透测试的发展，并对学术界和工业界产生了重要影响。