Large Language Models (LLMs) can generate code but often introduce security vulnerabilities, logical inconsistencies, and compilation errors. Prior work demonstrates that LLMs benefit substantially from structured feedback, static analysis, retrieval augmentation, and execution-based refinement. We propose a retrieval-augmented, multi-tool repair workflow in which a single code-generating LLM iteratively refines its outputs using compiler diagnostics, CodeQL security scanning, and KLEE symbolic execution. A lightweight embedding model is used for semantic retrieval of previously successful repairs, providing security-focused examples that guide generation. Evaluated on a combined dataset of 3,242 programs generated by DeepSeek-Coder-1.3B and CodeLlama-7B, the system demonstrates significant improvements in robustness. For DeepSeek, security vulnerabilities were reduced by 96%. For the larger CodeLlama model, the critical security defect rate was decreased from 58.55% to 22.19%, highlighting the efficacy of tool-assisted self-repair even on "stubborn" models.
翻译:大型语言模型(LLM)能够生成代码,但常常引入安全漏洞、逻辑不一致和编译错误。先前的研究表明,结构化反馈、静态分析、检索增强以及基于执行的优化能显著提升LLM的性能。我们提出了一种检索增强的多工具修复工作流,其中单一的代码生成LLM通过编译器诊断、CodeQL安全扫描和KLEE符号执行迭代优化其输出。该系统采用轻量级嵌入模型进行语义检索,获取以往成功的修复案例,从而为生成过程提供以安全为导向的示例。在由DeepSeek-Coder-1.3B和CodeLlama-7B生成的3,242个程序组成的综合数据集上进行评估,结果表明该系统在鲁棒性方面取得了显著提升。对于DeepSeek模型,安全漏洞减少了96%。对于规模更大的CodeLlama模型,关键安全缺陷率从58.55%降至22.19%,这凸显了工具辅助的自我修复机制即使在“顽固”模型上也具有显著效果。