Inference-Time Safety For Code LLMs Via Retrieval-Augmented Revision

Large Language Models (LLMs) are increasingly deployed for code generation in high-stakes software development, yet their limited transparency in security reasoning and brittleness to evolving vulnerability patterns raise critical trustworthiness concerns. Models trained on static datasets cannot readily adapt to newly discovered vulnerabilities or changing security standards without retraining, leading to the repeated generation of unsafe code. We present a principled approach to trustworthy code generation by design that operates as an inference-time safety mechanism. Our approach employs retrieval-augmented generation to surface relevant security risks in generated code and retrieve related security discussions from a curated Stack Overflow knowledge base, which are then used to guide an LLM during code revision. This design emphasizes three aspects relevant to trustworthiness: (1) interpretability, through transparent safety interventions grounded in expert community explanations; (2) robustness, by allowing adaptation to evolving security practices without model retraining; and (3) safety alignment, through real-time intervention before unsafe code reaches deployment. Across real-world and benchmark datasets, our approach improves the security of LLM-generated code compared to prompting alone, while introducing no new vulnerabilities as measured by static analysis. These results suggest that principled, retrieval-augmented inference-time interventions can serve as a complementary mechanism for improving the safety of LLM-based code generation, and highlight the ongoing value of community knowledge in supporting trustworthy AI deployment.

翻译：大语言模型（LLMs）正越来越多地部署于高风险软件开发中的代码生成任务，但其在安全推理方面的有限透明度以及对不断演变的漏洞模式的脆弱性引发了关键的信任问题。基于静态数据集训练的模型难以在不重新训练的情况下适应新发现的漏洞或变化的安全标准，导致持续生成不安全的代码。我们提出一种基于设计的可信代码生成原则性方法，该方法作为推理时安全机制运行。我们的方法采用检索增强生成技术，在生成的代码中识别相关安全风险，并从精心整理的Stack Overflow知识库中检索相关的安全讨论内容，进而指导大语言模型进行代码修订。该设计强调与可信度相关的三个维度：（1）可解释性——通过基于专家社区解释的透明安全干预；（2）鲁棒性——允许在不重新训练模型的情况下适应不断发展的安全实践；（3）安全对齐——通过在非安全代码部署前进行实时干预。在真实场景和基准数据集上的实验表明，相较于单纯提示方法，我们的方案能有效提升大语言模型生成代码的安全性，且经静态分析验证未引入新的漏洞。这些结果表明，基于检索增强的推理时干预机制可作为提升基于大语言模型的代码生成安全性的补充手段，并凸显了社区知识在支持可信人工智能部署中的持续价值。