A Preliminary Study on Using Large Language Models in Software Pentesting

Large language models (LLM) are perceived to offer promising potentials for automating security tasks, such as those found in security operation centers (SOCs). As a first step towards evaluating this perceived potential, we investigate the use of LLMs in software pentesting, where the main task is to automatically identify software security vulnerabilities in source code. We hypothesize that an LLM-based AI agent can be improved over time for a specific security task as human operators interact with it. Such improvement can be made, as a first step, by engineering prompts fed to the LLM based on the responses produced, to include relevant contexts and structures so that the model provides more accurate results. Such engineering efforts become sustainable if the prompts that are engineered to produce better results on current tasks, also produce better results on future unknown tasks. To examine this hypothesis, we utilize the OWASP Benchmark Project 1.2 which contains 2,740 hand-crafted source code test cases containing various types of vulnerabilities. We divide the test cases into training and testing data, where we engineer the prompts based on the training data (only), and evaluate the final system on the testing data. We compare the AI agent's performance on the testing data against the performance of the agent without the prompt engineering. We also compare the AI agent's results against those from SonarQube, a widely used static code analyzer for security testing. We built and tested multiple versions of the AI agent using different off-the-shelf LLMs -- Google's Gemini-pro, as well as OpenAI's GPT-3.5-Turbo and GPT-4-Turbo (with both chat completion and assistant APIs). The results show that using LLMs is a viable approach to build an AI agent for software pentesting that can improve through repeated use and prompt engineering.

翻译：大型语言模型（LLM）被认为在自动化安全任务（如安全运营中心中的任务）方面具有广阔潜力。为评估这一潜在价值，我们首先研究了LLM在软件渗透测试中的应用，该测试的主要任务是自动识别源代码中的软件安全漏洞。我们假设基于LLM的人工智能代理能够随着人类操作员的交互而针对特定安全任务持续改进。这种改进的第一步可通过基于模型生成的响应设计提示词来实现，即向LLM输入包含相关上下文和结构的提示，以使模型输出更准确的结果。若当前任务中设计的提示词能提升模型表现，且未来未知任务中同样有效，则此类工程设计将具备可持续性。为验证此假设，我们利用包含2740个手工构建的、含多种漏洞类型源代码测试用例的OWASP基准项目1.2版本。将测试用例分为训练集与测试集：仅基于训练集设计提示词，并在测试集上评估最终系统。我们将提示词工程设计前后人工智能代理在测试集上的表现进行对比，同时将其结果与广泛使用的安全测试静态代码分析工具SonarQube进行对比。我们基于不同商用LLM（Google的Gemini-pro及OpenAI的GPT-3.5-Turbo和GPT-4-Turbo，分别采用聊天补全与助手API）构建并测试了多个版本的人工智能代理。结果表明，使用LLM构建可通过重复使用和提示词工程设计持续改进的软件渗透测试人工智能代理是一种可行方案。