Code Large Language Models (Code LLMs) have been increasingly used by developers to boost productivity, but they often generate vulnerable code. Thus, there is an urgent need to ensure that code generated by Code LLMs is correct and secure. Previous research has primarily focused on generating secure code, overlooking the fact that secure code also needs to be correct. This oversight can lead to a false sense of security. Currently, the community lacks a method to measure actual progress in this area, and we need solutions that address both security and correctness of code generation. This paper introduces a new benchmark, CodeGuard+, along with two new metrics, to measure Code LLMs' ability to generate both secure and correct code. Using our new evaluation methods, we show that the state-of-the-art defense technique, prefix tuning, may not be as strong as previously believed, since it generates secure code but sacrifices functional correctness. We also demonstrate that different decoding methods significantly affect the security of Code LLMs. Furthermore, we explore a new defense direction: constrained decoding for secure code generation. We propose new constrained decoding techniques to generate secure code. Our results reveal that constrained decoding is more effective than prefix tuning to improve the security of Code LLMs, without requiring a specialized training dataset. Moreover, our evaluations over eight state-of-the-art Code LLMs show that constrained decoding has strong performance to improve the security of Code LLMs, and our technique outperforms GPT-4.
翻译:代码大语言模型(Code LLMs)正日益被开发者用于提升生产力,但它们生成的代码往往存在安全漏洞。因此,确保代码大语言模型生成的代码既正确又安全的需求日益迫切。先前的研究主要聚焦于生成安全代码,却忽视了安全代码也必须保证功能正确性这一事实。这种疏忽可能导致错误的安全感。目前,学术界尚缺乏衡量该领域实际进展的有效方法,我们需要能够同时兼顾代码生成安全性与正确性的解决方案。本文提出了一个新的基准测试集CodeGuard+以及两项新指标,用于评估代码大语言模型生成既安全又正确代码的能力。通过我们新的评估方法,我们发现当前最先进的防御技术——前缀调优(prefix tuning)可能并不如先前认为的那样有效,因为它虽然能生成安全代码,却牺牲了功能正确性。我们还证明,不同的解码方法会显著影响代码大语言模型的安全性。此外,我们探索了一个新的防御方向:通过约束解码实现安全代码生成。我们提出了新的约束解码技术来生成安全代码。实验结果表明,约束解码在提升代码大语言模型安全性方面比前缀调优更为有效,且无需专门的训练数据集。通过对八个前沿代码大语言模型的评估,我们发现约束解码在提升模型安全性方面表现优异,并且我们的技术性能超越了GPT-4。