Penetration testing, a critical component of cybersecurity, typically requires extensive time and effort to find vulnerabilities. Beginners in this field often benefit from collaborative approaches with the community or experts. To address this, we develop CIPHER (Cybersecurity Intelligent Penetration-testing Helper for Ethical Researchers), a large language model specifically trained to assist in penetration testing tasks. We trained CIPHER using over 300 high-quality write-ups of vulnerable machines, hacking techniques, and documentation of open-source penetration testing tools. Additionally, we introduced the Findings, Action, Reasoning, and Results (FARR) Flow augmentation, a novel method to augment penetration testing write-ups to establish a fully automated pentesting simulation benchmark tailored for large language models. This approach fills a significant gap in traditional cybersecurity Q\&A benchmarks and provides a realistic and rigorous standard for evaluating AI's technical knowledge, reasoning capabilities, and practical utility in dynamic penetration testing scenarios. In our assessments, CIPHER achieved the best overall performance in providing accurate suggestion responses compared to other open-source penetration testing models of similar size and even larger state-of-the-art models like Llama 3 70B and Qwen1.5 72B Chat, particularly on insane difficulty machine setups. This demonstrates that the current capabilities of general LLMs are insufficient for effectively guiding users through the penetration testing process. We also discuss the potential for improvement through scaling and the development of better benchmarks using FARR Flow augmentation results. Our benchmark will be released publicly at https://github.com/ibndias/CIPHER.
翻译:渗透测试作为网络安全的关键组成部分,通常需要投入大量时间和精力来发现漏洞。该领域的初学者往往受益于与社区或专家的协作方法。为此,我们开发了CIPHER(面向道德研究人员的网络安全智能渗透测试助手),这是一个专门训练用于协助渗透测试任务的大语言模型。我们使用超过300份关于易受攻击机器的高质量报告、黑客技术以及开源渗透测试工具的文档对CIPHER进行了训练。此外,我们引入了"发现、行动、推理与结果"(FARR)流程增强方法,这是一种新颖的技术,用于增强渗透测试报告,从而建立一个专为大语言模型设计的全自动渗透测试模拟基准。该方法填补了传统网络安全问答基准的显著空白,并为评估人工智能在动态渗透测试场景中的技术知识、推理能力和实际效用提供了一个现实且严格的标准。在我们的评估中,与类似规模的其他开源渗透测试模型以及更大的先进模型(如Llama 3 70B和Qwen1.5 72B Chat)相比,CIPHER在提供准确建议响应方面取得了最佳综合性能,尤其是在极高难度机器设置上。这表明当前通用大语言模型的能力不足以有效指导用户完成渗透测试过程。我们还讨论了通过模型扩展以及利用FARR流程增强结果开发更优基准的改进潜力。我们的基准将在https://github.com/ibndias/CIPHER公开发布。