In this study, we evaluated the capability of Large Language Models (LLMs), particularly OpenAI's GPT-4, in detecting software vulnerabilities, comparing their performance against traditional static code analyzers like Snyk and Fortify. Our analysis covered numerous repositories, including those from NASA and the Department of Defense. GPT-4 identified approximately four times the vulnerabilities than its counterparts. Furthermore, it provided viable fixes for each vulnerability, demonstrating a low rate of false positives. Our tests encompassed 129 code samples across eight programming languages, revealing the highest vulnerabilities in PHP and JavaScript. GPT-4's code corrections led to a 90% reduction in vulnerabilities, requiring only an 11% increase in code lines. A critical insight was LLMs' ability to self-audit, suggesting fixes for their identified vulnerabilities and underscoring their precision. Future research should explore system-level vulnerabilities and integrate multiple static code analyzers for a holistic perspective on LLMs' potential.
翻译:本研究评估了大型语言模型(LLMs),特别是OpenAI的GPT-4,在检测软件漏洞方面的能力,并将其性能与传统静态代码分析工具(如Snyk和Fortify)进行了对比。我们的分析覆盖了众多代码仓库,包括来自美国国家航空航天局(NASA)和国防部的项目。GPT-4识别出的漏洞数量约为对照工具的4倍。此外,它为每个漏洞提供了可行的修复方案,且误报率较低。我们的测试涵盖8种编程语言的129个代码样本,结果显示PHP和JavaScript中的漏洞最高。GPT-4的代码修正使漏洞减少了90%,而代码行数仅增加了11%。关键发现是LLMs具备自我审计能力,能为其识别出的漏洞提出修复建议,这突显了其精确性。未来研究应探索系统级漏洞,并结合多种静态代码分析工具,以全面审视LLMs的潜力。