SOSecure: Safer Code Generation with RAG and StackOverflow Discussions

Large Language Models (LLMs) are widely used for automated code generation. Their reliance on infrequently updated pretraining data leaves them unaware of newly discovered vulnerabilities and evolving security standards, making them prone to producing insecure code. In contrast, developer communities on Stack Overflow (SO) provide an ever-evolving repository of knowledge, where security vulnerabilities are actively discussed and addressed through collective expertise. These community-driven insights remain largely untapped by LLMs. This paper introduces SOSecure, a Retrieval-Augmented Generation (RAG) system that leverages the collective security expertise found in SO discussions to improve the security of LLM-generated code. We build a security-focused knowledge base by extracting SO answers and comments that explicitly identify vulnerabilities. Unlike common uses of RAG, SOSecure triggers after code has been generated to find discussions that identify flaws in similar code. These are used in a prompt to an LLM to consider revising the code. Evaluation across three datasets (SALLM, LLMSecEval, and LMSys) show that SOSecure achieves strong fix rates of 71.7%, 91.3%, and 96.7% respectively, compared to prompting GPT-4 without relevant discussions (49.1%, 56.5%, and 37.5%), and outperforms multiple other baselines. SOSecure operates as a language-agnostic complement to existing LLMs, without requiring retraining or fine-tuning, making it easy to deploy. Our results underscore the importance of maintaining active developer forums, which have dropped substantially in usage with LLM adoptions.

翻译：大型语言模型（LLM）被广泛用于自动化代码生成。然而，它们依赖于更新频率较低的预训练数据，导致其无法及时了解新发现的漏洞和不断演进的安全标准，从而容易生成不安全的代码。相比之下，Stack Overflow（SO）上的开发者社区提供了一个持续演进的知识库，其中通过集体专业知识积极讨论和解决安全漏洞问题。这些社区驱动的洞见在很大程度上尚未被LLM所利用。本文介绍了SOSecure，一个检索增强生成（RAG）系统，它利用SO讨论中蕴含的集体安全专业知识来提升LLM生成代码的安全性。我们通过提取明确识别漏洞的SO答案和评论，构建了一个专注于安全的知识库。与RAG的常见用法不同，SOSecure在代码生成后触发，以查找识别类似代码缺陷的讨论。这些信息被整合到提示中，供LLM参考以考虑修订代码。在三个数据集（SALLM、LLMSecEval和LMSys）上的评估表明，SOSecure分别实现了71.7%、91.3%和96.7%的高修复率，而相比之下，在没有相关讨论的情况下直接提示GPT-4的修复率仅为49.1%、56.5%和37.5%，并且SOSecure的表现优于多个其他基线方法。SOSecure作为现有LLM的语言无关补充，无需重新训练或微调，易于部署。我们的结果强调了维护活跃开发者论坛的重要性，尽管随着LLM的采用，这些论坛的使用量已大幅下降。