The rapid advancement of Large Language Models (LLMs) has created new opportunities for Automated Penetration Testing (AutoPT), spawning numerous frameworks aimed at achieving end-to-end autonomous attacks. However, despite the proliferation of related studies, existing research generally lacks systematic architectural analysis and large-scale empirical comparisons under a unified benchmark. Therefore, this paper presents the first Systematization of Knowledge (SoK) focusing on the architectural design and comprehensive empirical evaluation of current LLM-based AutoPT frameworks. At systematization level, we comprehensively review existing framework designs across six dimensions: agent architecture, agent plan, agent memory, agent execution, external knowledge, and benchmarks. At empirical level, we conduct large-scale experiments on 13 representative open-source AutoPT frameworks and 2 baseline frameworks utilizing a unified benchmark. The experiments consumed over 10 billion tokens in total and generated more than 1,500 execution logs, which were manually reviewed and analyzed over four months by a panel of more than 15 researchers with expertise in cybersecurity. By investigating the latest progress in this rapidly developing field, we provide researchers with a structured taxonomy to understand existing LLM-based AutoPT frameworks and a large-scale empirical benchmark, along with promising directions for future research.
翻译:大语言模型(LLM)的快速发展为自动化渗透测试(AutoPT)创造了新机遇,催生了大量旨在实现端到端自主攻击的框架。然而,尽管相关研究日益增多,现有研究普遍缺乏对系统架构的分析,且缺少在统一基准下的大规模实证比较。为此,本文首次对当前基于LLM的AutoPT框架进行了知识体系系统化梳理(SoK),重点关注其架构设计与全面实证评估。在系统化层面,我们从六个维度对现有框架设计进行全面综述:代理架构、代理规划、代理记忆、代理执行、外部知识和基准测试。在实证层面,我们基于统一基准,对13个代表性开源AutoPT框架和2个基线框架开展了大规模实验。实验总计消耗超过100亿个令牌,生成1500余份执行日志,由逾15名网络安全领域专家组成的专家组历时四个月进行人工审核与分析。通过探究这一快速发展领域的最新进展,我们为研究人员提供了理解现有基于LLM的AutoPT框架的结构化分类体系、大规模实证基准,以及未来研究的可行方向。