Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy

Despite the remarkable versatility of Large Language Models (LLMs) and Multimodal LLMs (MLLMs) to generalize across both language and vision tasks, LLMs and MLLMs have shown vulnerability to jailbreaking, generating textual outputs that undermine safety, ethical, and bias standards when exposed to harmful or sensitive inputs. With the recent advancement of safety alignment via preference-tuning from human feedback, LLMs and MLLMs have been equipped with safety guardrails to yield safe, ethical, and fair responses with regard to harmful inputs. However, despite the significance of safety alignment, research on the vulnerabilities remains largely underexplored. In this paper, we investigate the unexplored vulnerability of the safety alignment, examining its ability to consistently provide safety guarantees for out-of-distribution(OOD)-ifying harmful inputs that may fall outside the aligned data distribution. Our key observation is that OOD-ifying the vanilla harmful inputs highly increases the uncertainty of the model to discern the malicious intent within the input, leading to a higher chance of being jailbroken. Exploiting this vulnerability, we propose JOOD, a new Jailbreak framework via OOD-ifying inputs beyond the safety alignment. We explore various off-the-shelf visual and textual transformation techniques for OOD-ifying the harmful inputs. Notably, we observe that even simple mixing-based techniques such as image mixup prove highly effective in increasing the uncertainty of the model, thereby facilitating the bypass of the safety alignment. Experiments across diverse jailbreak scenarios demonstrate that JOOD effectively jailbreaks recent proprietary LLMs and MLLMs such as GPT-4 and o1 with high attack success rate, which previous attack approaches have consistently struggled to jailbreak. Code is available at https://github.com/naver-ai/JOOD.

翻译：尽管大型语言模型（LLMs）与多模态大型语言模型（MLLMs）在语言和视觉任务中展现出卓越的泛化能力，但研究表明它们容易受到"越狱"攻击，即在接触有害或敏感输入时生成违背安全、伦理与公平性准则的文本输出。随着基于人类反馈的偏好微调技术推动安全对齐取得最新进展，LLMs与MLLMs已配备安全护栏机制，能够针对有害输入产生安全、合规且公正的响应。然而，尽管安全对齐具有重要意义，其潜在脆弱性的研究仍处于探索不足的状态。本文深入探究了安全对齐中尚未被充分研究的脆弱性，检验其对于超出对齐数据分布的分布外有害输入能否持续提供安全保障。我们的核心发现是：对原始有害输入进行分布外化处理会显著增加模型识别输入中恶意意图的不确定性，从而导致更高的越狱成功率。基于此脆弱性，我们提出了JOOD——一种通过将输入分布外化来突破安全对齐的新型越狱框架。我们探索了多种现成的视觉与文本转换技术来实现有害输入的分布外化。值得注意的是，我们发现即使是简单的混合技术（如图像混合）也能有效提升模型的不确定性，从而帮助绕过安全对齐机制。在多样化越狱场景下的实验表明，JOOD能以前人攻击方法难以实现的高成功率，有效突破GPT-4、o1等最新专有LLMs与MLLMs的安全防护。代码已开源：https://github.com/naver-ai/JOOD。