Large Language Models (LLMs) demonstrate remarkable capabilities across diverse applications. However, concerns regarding their security, particularly the vulnerability to jailbreak attacks, persist. Drawing inspiration from adversarial training in deep learning and LLM agent learning processes, we introduce the In-Context Adversarial Game (ICAG) for defending against jailbreaks without the need for fine-tuning. ICAG leverages agent learning to conduct an adversarial game, aiming to dynamically extend knowledge to defend against jailbreaks. Unlike traditional methods that rely on static datasets, ICAG employs an iterative process to enhance both the defense and attack agents. This continuous improvement process strengthens defenses against newly generated jailbreak prompts. Our empirical studies affirm ICAG's efficacy, where LLMs safeguarded by ICAG exhibit significantly reduced jailbreak success rates across various attack scenarios. Moreover, ICAG demonstrates remarkable transferability to other LLMs, indicating its potential as a versatile defense mechanism.
翻译:大语言模型(LLMs)在各类应用中展现出卓越能力,但其安全性问题,尤其是对越狱攻击的脆弱性,始终令人关注。受深度学习中的对抗训练与LLM智能体学习过程启发,我们提出无需微调的上下文对抗博弈(ICAG)方法以防御越狱攻击。ICAG利用智能体学习机制构建对抗博弈,旨在动态扩展知识体系以抵御越狱攻击。与传统依赖静态数据集的方法不同,ICAG采用迭代过程同时增强防御智能体与攻击智能体,这种持续改进机制有效强化了对新型越狱提示的防御能力。实证研究表明,ICAG具有显著效果:经ICAG防护的LLM在多种攻击场景下均展现出大幅降低的越狱成功率。此外,ICAG还展现出对其他LLM的卓越迁移性,表明其作为通用防御机制的潜力。