The dual offensive and defensive utility of Large Language Models (LLMs) highlights a critical gap in AI security: the lack of unified frameworks for dynamic, iterative adversarial adaptation hardening. To bridge this gap, we propose the Red Team vs. Blue Team (RvB) framework, formulated as a training-free, sequential, imperfect-information game. In this process, the Red Team exposes vulnerabilities, driving the Blue Team to learning effective solutions without parameter updates. We validate our framework across two challenging domains: dynamic code hardening against CVEs and guardrail optimization against jailbreaks. Our empirical results show that this interaction compels the Blue Team to learn fundamental defensive principles, leading to robust remediations that are not merely overfitted to specific exploits. RvB achieves Defense Success Rates of 90\% and 45\% across the respective tasks while maintaining near 0\% False Positive Rates, significantly surpassing baselines. This work establishes the iterative adversarial interaction framework as a practical paradigm that automates the continuous hardening of AI systems.
翻译:大型语言模型(LLM)兼具攻击与防御的双重效用,凸显了当前AI安全领域的关键缺陷:缺乏能够实现动态、迭代式对抗适应性强化的统一框架。为弥补这一空白,我们提出红队对抗蓝队框架,该框架被形式化为一种无需训练、顺序化且具有不完美信息的博弈过程。在此过程中,红队负责暴露系统漏洞,驱动蓝队在不更新参数的情况下学习有效解决方案。我们在两个具有挑战性的领域验证了该框架:针对CVE的动态代码强化和针对越狱攻击的护栏优化。实验结果表明,这种对抗交互迫使蓝队掌握根本性的防御原则,从而产生具有鲁棒性的修复方案,而非仅针对特定攻击的过拟合对策。在保持接近0%误报率的同时,RvB在两项任务中分别实现了90%和45%的防御成功率,显著超越基线方法。本工作确立了迭代式对抗交互框架作为一种实用范式,能够实现AI系统持续强化的自动化。