The advent of large language models (LLMs) has revolutionized the field of natural language processing, yet they might be attacked to produce harmful content. Despite efforts to ethically align LLMs, these are often fragile and can be circumvented by jailbreaking attacks through optimized or manual adversarial prompts. To address this, we introduce the Information Bottleneck Protector (IBProtector), a defense mechanism grounded in the information bottleneck principle, and we modify the objective to avoid trivial solutions. The IBProtector selectively compresses and perturbs prompts, facilitated by a lightweight and trainable extractor, preserving only essential information for the target LLMs to respond with the expected answer. Moreover, we further consider a situation where the gradient is not visible to be compatible with any LLM. Our empirical evaluations show that IBProtector outperforms current defense methods in mitigating jailbreak attempts, without overly affecting response quality or inference speed. Its effectiveness and adaptability across various attack methods and target LLMs underscore the potential of IBProtector as a novel, transferable defense that bolsters the security of LLMs without requiring modifications to the underlying models.
翻译:大型语言模型的出现彻底改变了自然语言处理领域,然而它们可能被攻击以生成有害内容。尽管在伦理对齐方面付出了努力,但这些对齐往往脆弱,可通过优化的或手工制作的对抗性提示以越狱攻击的方式被规避。为解决这一问题,我们提出了基于信息瓶颈原理的防御机制——信息瓶颈保护器(IBProtector),并对目标函数进行了修改以避免平凡解。IBProtector通过轻量级可训练的提取器,对提示进行选择性压缩和扰动,仅保留目标大型语言模型生成预期答案所必需的信息。此外,我们进一步考虑了梯度不可见的情形,以实现与任意大型语言模型的兼容性。实验评估表明,IBProtector在缓解越狱尝试方面优于现有防御方法,且不会过度影响响应质量或推理速度。其在不同攻击方法和目标大型语言模型上的有效性和适应性,凸显了IBProtector作为一种无需修改底层模型即可增强大型语言模型安全性的新型可迁移防御方法的潜力。