Penetration testing, an essential component of software security testing, allows organizations to proactively identify and remediate vulnerabilities in their systems, thus bolstering their defense mechanisms against potential cyberattacks. One recent advancement in the realm of penetration testing is the utilization of Language Models (LLMs). We explore the intersection of LLMs and penetration testing to gain insight into their capabilities and challenges in the context of privilege escalation. We create an automated Linux privilege-escalation benchmark utilizing local virtual machines. We introduce an LLM-guided privilege-escalation tool designed for evaluating different LLMs and prompt strategies against our benchmark. Our results show that GPT-4 is well suited for detecting file-based exploits as it can typically solve 75-100\% of test-cases of that vulnerability class. GPT-3.5-turbo was only able to solve 25-50% of those, while local models, such as Llama2 were not able to detect any exploits. We analyze the impact of different prompt designs, the benefits of in-context learning, and the advantages of offering high-level guidance to LLMs. We discuss challenging areas for LLMs, including maintaining focus during testing, coping with errors, and finally comparing them with both stochastic parrots as well as with human hackers.
翻译:渗透测试是软件安全测试的重要组成部分,可帮助组织主动识别和修复系统中的漏洞,从而增强其对潜在网络攻击的防御能力。渗透测试领域的最新进展之一是语言模型(LLMs)的应用。我们探索了LLMs与渗透测试的交叉领域,以深入了解其在权限提升场景中的能力与挑战。我们利用本地虚拟机创建了一个自动化的Linux权限提升基准测试,并引入了一种基于LLM引导的权限提升工具,用于评估不同LLM及提示策略在该基准上的表现。结果表明,GPT-4非常适用于检测基于文件的漏洞利用,通常能解决该类漏洞75-100%的测试用例。GPT-3.5-turbo仅能解决其中25-50%的用例,而Llama2等本地模型则完全无法检测到任何漏洞利用。我们分析了不同提示设计的影响、上下文学习的优势,以及向LLM提供高层指导的益处。同时,我们探讨了LLM面临的挑战性领域,包括测试过程中保持专注、应对错误,以及最终将其与随机鹦鹉及人类黑客进行对比。