LLM models are increasingly used to generate code, but the quality and security of this code are often uncertain. Several recent studies have raised alarm bells, indicating that such AI-generated code may be particularly vulnerable to cyberattacks. However, most of these studies rely on code that is generated specifically for the study, which raises questions about the realism of such experiments. In this study, we perform a large-scale empirical analysis of real-life code generated by ChatGPT. We evaluate code generated by ChatGPT both with respect to correctness and security and delve into the intentions of users who request code from the model. We further performed an experiment to evaluate the effectiveness of common prompt engineering strategies using real-life prompts. Our study supports earlier research that employed synthetic queries and produced proof that LLM-generated code is frequently insufficient in terms of security. Additionally, we observe that users don't ask many questions about the security characteristics of the code they ask LLMs to provide.
翻译:LLM模型越来越多地被用于生成代码,但这些代码的质量和安全性往往不确定。近期多项研究敲响警钟,指出此类AI生成代码可能特别容易受到网络攻击。然而,这些研究大多依赖专门为研究生成的代码,引发了对实验真实性的质疑。本研究对ChatGPT生成的真实代码进行了大规模实证分析,我们从正确性和安全性两方面评估ChatGPT生成的代码,并深入探究用户向模型请求代码的意图。我们还利用真实提示语进行实验,评估常见提示工程策略的有效性。本研究证实了早期使用合成查询的研究结果,并证明LLM生成的代码在安全性方面往往存在不足。此外,我们观察到用户很少询问他们要求LLM提供的代码的安全特性。