Can We Trust Large Language Models Generated Code? A Framework for In-Context Learning, Security Patterns, and Code Evaluations Across Diverse LLMs

Large Language Models (LLMs) such as ChatGPT and GitHub Copilot have revolutionized automated code generation in software engineering. However, as these models are increasingly utilized for software development, concerns have arisen regarding the security and quality of the generated code. These concerns stem from LLMs being primarily trained on publicly available code repositories and internet-based textual data, which may contain insecure code. This presents a significant risk of perpetuating vulnerabilities in the generated code, creating potential attack vectors for exploitation by malicious actors. Our research aims to tackle these issues by introducing a framework for secure behavioral learning of LLMs through In-Content Learning (ICL) patterns during the code generation process, followed by rigorous security evaluations. To achieve this, we have selected four diverse LLMs for experimentation. We have evaluated these coding LLMs across three programming languages and identified security vulnerabilities and code smells. The code is generated through ICL with curated problem sets and undergoes rigorous security testing to evaluate the overall quality and trustworthiness of the generated code. Our research indicates that ICL-driven one-shot and few-shot learning patterns can enhance code security, reducing vulnerabilities in various programming scenarios. Developers and researchers should know that LLMs have a limited understanding of security principles. This may lead to security breaches when the generated code is deployed in production systems. Our research highlights LLMs are a potential source of new vulnerabilities to the software supply chain. It is important to consider this when using LLMs for code generation. This research article offers insights into improving LLM security and encourages proactive use of LLMs for code generation to ensure software system safety.

翻译：以ChatGPT和GitHub Copilot为代表的大型语言模型（LLMs）已彻底改变了软件工程中的自动化代码生成领域。然而，随着这些模型在软件开发中的应用日益广泛，其生成代码的安全性与质量问题逐渐引发关注。这些担忧源于LLMs主要基于公开代码仓库和互联网文本数据进行训练，而这些数据可能包含不安全的代码。这导致生成代码存在持续传播漏洞的显著风险，可能为恶意攻击者创造可利用的攻击途径。本研究旨在通过引入一个框架来解决这些问题，该框架在代码生成过程中通过上下文学习（ICL）模式实现LLMs的安全行为学习，并进行严格的安全评估。为此，我们选取了四种不同的LLMs进行实验。我们在三种编程语言中对这些代码生成LLMs进行了评估，并识别出安全漏洞和代码异味。代码通过基于精选问题集的ICL生成，并经过严格的安全测试以评估生成代码的整体质量与可信度。我们的研究表明，ICL驱动的单样本和少样本学习模式能够提升代码安全性，减少不同编程场景中的漏洞。开发者和研究人员需认识到，LLMs对安全原则的理解存在局限，这可能导致生成的代码部署至生产系统时引发安全漏洞。本研究强调LLMs可能成为软件供应链中新漏洞的来源，在使用LLMs进行代码生成时必须考虑这一风险。本文为提升LLM安全性提供了见解，并鼓励在代码生成中积极而审慎地使用LLMs，以确保软件系统的安全。