Prompt engineering reduces reasoning mistakes in Large Language Models (LLMs). However, its effectiveness in mitigating vulnerabilities in LLM-generated code remains underexplored. To address this gap, we implemented a benchmark to automatically assess the impact of various prompt engineering strategies on code security. Our benchmark leverages two peer-reviewed prompt datasets and employs static scanners to evaluate code security at scale. We tested multiple prompt engineering techniques on GPT-3.5-turbo, GPT-4o, and GPT-4o-mini. Our results show that for GPT-4o and GPT-4o-mini, a security-focused prompt prefix can reduce the occurrence of security vulnerabilities by up to 56%. Additionally, all tested models demonstrated the ability to detect and repair between 41.9% and 68.7% of vulnerabilities in previously generated code when using iterative prompting techniques. Finally, we introduce a "prompt agent" that demonstrates how the most effective techniques can be applied in real-world development workflows.
翻译:提示工程能够减少大型语言模型(LLM)的推理错误。然而,其在缓解LLM生成代码中的安全漏洞方面的有效性仍未得到充分探索。为填补这一空白,我们实现了一个基准测试,用于自动评估各种提示工程策略对代码安全性的影响。我们的基准测试利用了两个经过同行评审的提示数据集,并采用静态扫描器进行大规模代码安全性评估。我们在GPT-3.5-turbo、GPT-4o和GPT-4o-mini上测试了多种提示工程技术。结果表明,对于GPT-4o和GPT-4o-mini,添加以安全为导向的提示前缀可将安全漏洞的出现率降低高达56%。此外,在使用迭代提示技术时,所有测试模型均展现出能够检测并修复先前生成代码中41.9%至68.7%的漏洞的能力。最后,我们引入了一种“提示代理”,展示了如何在实际开发工作流程中应用最有效的技术。