As large language models (LLMs) are increasingly used for code generation, concerns over the security risks have grown substantially. Early research has primarily focused on red teaming, which aims to uncover and evaluate vulnerabilities and risks of CodeGen models. However, progress on the blue teaming side remains limited, as developing defense requires effective semantic understanding to differentiate the unsafe from the safe. To fill in this gap, we propose BlueCodeAgent, an end-to-end blue teaming agent enabled by automated red teaming. Our framework integrates both sides: red teaming generates diverse risky instances, while the blue teaming agent leverages these to detect previously seen and unseen risk scenarios through constitution and code analysis with agentic integration for multi-level defense. Our evaluation across three representative code-related tasks--bias instruction detection, malicious instruction detection, and vulnerable code detection--shows that BlueCodeAgent achieves significant gains over the base models and safety prompt-based defenses. In particular, for vulnerable code detection tasks, BlueCodeAgent integrates dynamic analysis to effectively reduce false positives, a challenging problem as base models tend to be over-conservative, misclassifying safe code as unsafe. Overall, BlueCodeAgent achieves an average 12.7\% F1 score improvement across four datasets in three tasks, attributed to its ability to summarize actionable constitutions that enhance context-aware risk detection. We demonstrate that the red teaming benefits the blue teaming by continuously identifying new vulnerabilities to enhance defense performance.
翻译:随着大语言模型(LLM)在代码生成中的应用日益广泛,对其安全风险的担忧也显著增加。早期研究主要集中于红队测试,旨在发现和评估代码生成模型的漏洞与风险。然而,蓝队防御方面的进展仍然有限,因为开发有效的防御需要具备区分安全与不安全代码的语义理解能力。为填补这一空白,我们提出了BlueCodeAgent——一种通过自动化红队测试实现的端到端蓝队智能体。我们的框架整合了红蓝双方:红队测试生成多样化的风险实例,而蓝队智能体则利用这些实例,通过构建行为准则和代码分析,结合智能体集成实现多层次防御,从而检测已知及未知的风险场景。我们在三个代表性代码相关任务(偏见指令检测、恶意指令检测和漏洞代码检测)上的评估表明,BlueCodeAgent相较于基础模型及基于安全提示的防御方法取得了显著提升。特别是在漏洞代码检测任务中,BlueCodeAgent集成了动态分析以有效降低误报率——这是一个具有挑战性的问题,因为基础模型往往过于保守,容易将安全代码误判为不安全。总体而言,BlueCodeAgent在三个任务的四个数据集上实现了平均12.7%的F1分数提升,这归功于其能够总结出可操作的准则,从而增强上下文感知的风险检测能力。我们证明了红队测试通过持续识别新漏洞来提升防御性能,使蓝队防御从中受益。