Large Language Models (LLMs) demonstrate remarkable capabilities across diverse applications. However, concerns regarding their security, particularly the vulnerability to jailbreak attacks, persist. Drawing inspiration from adversarial training in deep learning and LLM agent learning processes, we introduce the In-Context Adversarial Game (ICAG) for defending against jailbreaks without the need for fine-tuning. ICAG leverages agent learning to conduct an adversarial game, aiming to dynamically extend knowledge to defend against jailbreaks. Unlike traditional methods that rely on static datasets, ICAG employs an iterative process to enhance both the defense and attack agents. This continuous improvement process strengthens defenses against newly generated jailbreak prompts. Our empirical studies affirm ICAG's efficacy, where LLMs safeguarded by ICAG exhibit significantly reduced jailbreak success rates across various attack scenarios. Moreover, ICAG demonstrates remarkable transferability to other LLMs, indicating its potential as a versatile defense mechanism.
翻译:大型语言模型(LLM)在多样化应用中展现出卓越能力。然而,其安全性问题——尤其是对越狱攻击的脆弱性——持续引发关注。受深度学习中的对抗训练和LLM智能体学习过程的启发,我们提出了一种无需微调的越狱防御方法:上下文对抗博弈(ICAG)。ICAG利用智能体学习机制展开对抗博弈,旨在通过动态扩展知识来防御越狱攻击。与传统依赖静态数据集的方法不同,ICAG采用迭代过程持续优化防御方与攻击方智能体。这种持续改进机制能有效强化针对新生成越狱提示的防御能力。实证研究证实了ICAG的有效性:受ICAG保护的LLM在多种攻击场景下均表现出显著降低的越狱成功率。此外,ICAG展现出卓越的跨模型迁移能力,表明其具备成为通用防御机制的潜力。