This research carries out a comprehensive examination of web application code security, when generated by Large Language Models through analyzing a dataset comprising 2,500 small dynamic PHP websites. These AI-generated sites are scanned for security vulnerabilities after being deployed as standalone websites in Docker containers. The evaluation of the websites was conducted using a hybrid methodology, incorporating the Burp Suite active scanner, static analysis, and manual checks. Our investigation zeroes in on identifying and analyzing File Upload, SQL Injection, Stored XSS, and Reflected XSS. This approach not only underscores the potential security flaws within AI-generated PHP code but also provides a critical perspective on the reliability and security implications of deploying such code in real-world scenarios. Our evaluation confirms that 27% of the programs generated by GPT-4 verifiably contains vulnerabilities in the PHP code, where this number -- based on static scanning and manual verification -- is potentially much higher. This poses a substantial risks to software safety and security. In an effort to contribute to the research community and foster further analysis, we have made the source codes publicly available, alongside a record enumerating the detected vulnerabilities for each sample. This study not only sheds light on the security aspects of AI-generated code but also underscores the critical need for rigorous testing and evaluation of such technologies for software development.
翻译:本研究对大语言模型生成的Web应用程序代码安全性进行了全面考察,通过分析由2500个小型动态PHP网站组成的数据集展开研究。这些AI生成的网站在Docker容器中作为独立网站部署后,被进行了安全漏洞扫描。我们采用混合方法论对网站进行评估,综合运用了Burp Suite主动扫描器、静态分析与人工核查。研究重点聚焦于文件上传、SQL注入、存储型XSS及反射型XSS的识别与分析。该方法不仅凸显了AI生成PHP代码中潜在的安全缺陷,更为评估此类代码在实际场景中部署的可靠性及安全性影响提供了批判性视角。评估证实,GPT-4生成的程序中27%的PHP代码确实存在漏洞——基于静态扫描与人工验证,该数字可能显著更高。这给软件安全带来了重大风险。为助力学术界研究并促进深入分析,我们已将源代码及记载每个样本检测到的漏洞记录公开发布。本研究不仅揭示了AI生成代码的安全性问题,更强调了在软件开发中对此类技术进行严格测试与评估的迫切必要性。