We introduce HackSynth, a novel Large Language Model (LLM)-based agent capable of autonomous penetration testing. HackSynth's dual-module architecture includes a Planner and a Summarizer, which enable it to generate commands and process feedback iteratively. To benchmark HackSynth, we propose two new Capture The Flag (CTF)-based benchmark sets utilizing the popular platforms PicoCTF and OverTheWire. These benchmarks include two hundred challenges across diverse domains and difficulties, providing a standardized framework for evaluating LLM-based penetration testing agents. Based on these benchmarks, extensive experiments are presented, analyzing the core parameters of HackSynth, including creativity (temperature and top-p) and token utilization. Multiple open source and proprietary LLMs were used to measure the agent's capabilities. The experiments show that the agent performed best with the GPT-4o model, better than what the GPT-4o's system card suggests. We also discuss the safety and predictability of HackSynth's actions. Our findings indicate the potential of LLM-based agents in advancing autonomous penetration testing and the importance of robust safeguards. HackSynth and the benchmarks are publicly available to foster research on autonomous cybersecurity solutions.
翻译:本文介绍了HackSynth,一种基于大语言模型(LLM)的新型自主渗透测试智能体。HackSynth采用双模块架构,包含规划器与摘要生成器,使其能够迭代生成指令并处理反馈信息。为评估HackSynth性能,我们基于主流平台PicoCTF和OverTheWire构建了两套全新的夺旗赛(CTF)基准测试集。这些基准测试涵盖多领域、多难度的两百项挑战任务,为评估基于LLM的渗透测试智能体提供了标准化框架。基于该基准测试,我们开展了大量实验,分析了HackSynth的核心参数,包括创造性(温度参数与top-p参数)和令牌利用率。实验采用多种开源与专有LLM以衡量智能体性能。结果表明,该智能体在GPT-4o模型上表现最优,其性能甚至超越了GPT-4o系统卡片所标称的能力。我们还探讨了HackSynth行为的安全性与可预测性。研究发现表明基于LLM的智能体在推进自主渗透测试方面具有潜力,同时凸显了构建稳健安全防护机制的重要性。HackSynth系统及相关基准测试集已公开发布,以促进自主网络安全解决方案的研究。