BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger

Multimodal Large Language Models (MLLMs) have showcased impressive performance in a variety of multimodal tasks. On the other hand, the integration of additional image modality may allow the malicious users to inject harmful content inside the images for jailbreaking. Unlike text-based LLMs, where adversaries need to select discrete tokens to conceal their malicious intent using specific algorithms, the continuous nature of image signals provides a direct opportunity for adversaries to inject harmful intentions. In this work, we propose $\textbf{BaThe}$ ($\textbf{Ba}$ckdoor $\textbf{T}$rigger S$\textbf{h}$i$\textbf{e}$ld), a simple yet effective jailbreak defense mechanism. Our work is motivated by recent research on jailbreak backdoor attack and virtual prompt backdoor attack in generative language models. Jailbreak backdoor attack uses harmful instructions combined with manually crafted strings as triggers to make the backdoored model generate prohibited responses. We assume that harmful instructions can function as triggers, and if we alternatively set rejection responses as the triggered response, the backdoored model then can defend against jailbreak attacks. We achieve this by utilizing virtual rejection prompt, similar to the virtual prompt backdoor attack. We embed the virtual rejection prompt into the soft text embeddings, which we call ``wedge''. Our comprehensive experiments demonstrate that BaThe effectively mitigates various types of jailbreak attacks and is adaptable to defend against unseen attacks, with minimal impact on MLLMs' performance.

翻译：多模态大语言模型（MLLMs）在各种多模态任务中展现了令人印象深刻的性能。另一方面，额外图像模态的整合可能使恶意用户通过在图像中注入有害内容来实现越狱攻击。与基于文本的LLMs不同（在文本模型中，攻击者需要使用特定算法选择离散的token来隐藏其恶意意图），图像信号的连续性为攻击者直接注入有害意图提供了机会。在这项工作中，我们提出了 $\textbf{BaThe}$（$\textbf{Ba}$ckdoor $\textbf{T}$rigger S$\textbf{h}$i$\textbf{e}$ld），一种简单而有效的越狱防御机制。我们的工作受到生成式语言模型中关于越狱后门攻击和虚拟提示后门攻击的最新研究启发。越狱后门攻击使用有害指令结合人工设计的字符串作为触发器，使被植入后门的模型生成被禁止的响应。我们假设有害指令本身可以充当触发器，如果我们转而将拒绝响应设置为触发响应，那么被植入后门的模型就能够防御越狱攻击。我们通过利用虚拟拒绝提示来实现这一点，类似于虚拟提示后门攻击。我们将虚拟拒绝提示嵌入到软文本嵌入中，并将其称为“楔子”。我们全面的实验表明，BaThe 能有效缓解多种类型的越狱攻击，并能适应性地防御未见过的攻击，同时对 MLLMs 的性能影响极小。