The deployment of multimodal large language models (MLLMs) has brought forth a unique vulnerability: susceptibility to malicious attacks through visual inputs. This paper investigates the novel challenge of defending MLLMs against such attacks. Compared to large language models (LLMs), MLLMs include an additional image modality. We discover that images act as a ``foreign language" that is not considered during safety alignment, making MLLMs more prone to producing harmful responses. Unfortunately, unlike the discrete tokens considered in text-based LLMs, the continuous nature of image signals presents significant alignment challenges, which poses difficulty to thoroughly cover all possible scenarios. This vulnerability is exacerbated by the fact that most state-of-the-art MLLMs are fine-tuned on limited image-text pairs that are much fewer than the extensive text-based pretraining corpus, which makes the MLLMs more prone to catastrophic forgetting of their original abilities during safety fine-tuning. To tackle these challenges, we introduce MLLM-Protector, a plug-and-play strategy that solves two subtasks: 1) identifying harmful responses via a lightweight harm detector, and 2) transforming harmful responses into harmless ones via a detoxifier. This approach effectively mitigates the risks posed by malicious visual inputs without compromising the original performance of MLLMs. Our results demonstrate that MLLM-Protector offers a robust solution to a previously unaddressed aspect of MLLM security.
翻译:多模态大语言模型的部署揭示了一个独特的脆弱性:易受通过视觉输入发起的恶意攻击。本文研究了防御此类攻击这一新颖挑战。与大型语言模型相比,多模态大语言模型包含额外的图像模态。我们发现,图像作为一种在安全对齐过程中未被考虑的“外语”,使得多模态大语言模型更容易产生有害响应。遗憾的是,与基于文本的大型语言模型所处理的离散token不同,图像信号的连续性带来了显著的对齐挑战,这使得难以全面覆盖所有可能场景。这一脆弱性因以下事实而加剧:大多数最先进的多模态大语言模型是在有限的图像-文本对上进行微调的,其数量远少于基于文本的广泛预训练语料库,这导致多模态大语言模型在安全微调期间更容易发生对其原始能力的灾难性遗忘。为应对这些挑战,我们提出了MLLM-Protector,一种即插即用的策略,它解决两个子任务:1)通过一个轻量级的有害检测器识别有害响应;2)通过一个去毒器将有害响应转化为无害响应。该方法有效缓解了恶意视觉输入带来的风险,同时不损害多模态大语言模型的原始性能。我们的结果表明,MLLM-Protector为先前未被解决的多模态大语言模型安全问题提供了一个鲁棒的解决方案。