While several studies have examined the security of code generated by GPT and other Large Language Models (LLMs), most have relied on controlled experiments rather than real developer interactions. This paper investigates the security of GPT-generated code extracted from the DevGPT dataset and evaluates the ability of current LLMs to detect and repair vulnerabilities in this real-world context. We analysed 2,315 C, C++, and C# code snippets using static scanners combined with manual inspection, identifying 56 vulnerabilities across 48 files. These files were then assessed using GPT-4.1, GPT-5, and Claude Opus 4.1 to determine whether these could identify the security issues and, where applicable, to specify the corresponding Common Weakness Enumeration (CWE) numbers and propose fixes. Manual review and re-scanning of the modified code showed that GPT-4.1, GPT-5, and Claude Opus 4.1 correctly detected 46, 44, and 45 vulnerabilities, and successfully repaired 42, 44, and 43 respectively. A comparison of experiments conducted in October 2024 and September 2025 indicates substantial progress, with overall detection and remediation rates improving from roughly 50 % to around 75 - 80 %. We also observe that LLM-generated code is about as likely to contain vulnerabilities as developer-written code, and that LLMs may confidently provide incorrect information, posing risks for less experienced developers.
翻译:尽管已有若干研究探讨了GPT及其他大型语言模型(LLM)生成代码的安全性,但多数依赖于受控实验而非真实的开发者交互场景。本文基于DevGPT数据集提取的GPT生成代码,研究其安全性,并评估当前LLM在真实场景中检测与修复漏洞的能力。我们结合静态扫描工具与人工审查,分析了2,315个C、C++及C#代码片段,在48个文件中识别出56个安全漏洞。随后使用GPT-4.1、GPT-5和Claude Opus 4.1对这些文件进行评估,检验其能否识别安全问题,并在适用时指定对应的通用缺陷枚举(CWE)编号并提出修复方案。经人工复核与修改代码的重新扫描显示,GPT-4.1、GPT-5和Claude Opus 4.1分别正确检测出46、44和45个漏洞,并成功修复了42、44和43个漏洞。对比2024年10月与2025年9月的实验结果表明,整体检测与修复率从约50%提升至约75-80%,显示出显著进步。我们还观察到,LLM生成的代码与开发者编写的代码在包含漏洞的可能性上相当,且LLM可能自信地提供错误信息,这对经验不足的开发者构成潜在风险。