The deployment of multimodal large language models (MLLMs) has brought forth a unique vulnerability: susceptibility to malicious attacks through visual inputs. We delve into the novel challenge of defending MLLMs against such attacks. We discovered that images act as a "foreign language" that is not considered during alignment, which can make MLLMs prone to producing harmful responses. Unfortunately, unlike the discrete tokens considered in text-based LLMs, the continuous nature of image signals presents significant alignment challenges, which poses difficulty to thoroughly cover the possible scenarios. This vulnerability is exacerbated by the fact that open-source MLLMs are predominantly fine-tuned on limited image-text pairs that is much less than the extensive text-based pretraining corpus, which makes the MLLMs more prone to catastrophic forgetting of their original abilities during explicit alignment tuning. To tackle these challenges, we introduce MLLM-Protector, a plug-and-play strategy combining a lightweight harm detector and a response detoxifier. The harm detector's role is to identify potentially harmful outputs from the MLLM, while the detoxifier corrects these outputs to ensure the response stipulates to the safety standards. This approach effectively mitigates the risks posed by malicious visual inputs without compromising the model's overall performance. Our results demonstrate that MLLM-Protector offers a robust solution to a previously unaddressed aspect of MLLM security.
翻译:多模态大语言模型的部署暴露了一个独特的漏洞:易受通过视觉输入发起的恶意攻击。我们深入研究了防御此类攻击的新挑战。研究发现,图像作为一种对齐过程中未被考虑的“外语”,可能使多模态大语言模型倾向于产生有害响应。不幸的是,与基于文本的大语言模型中的离散词元不同,图像信号的连续性本质带来了显著的对齐挑战,使得难以全面覆盖所有可能场景。开源多模态大语言模型主要在有限的图文对数据上微调(远少于基于文本的预训练语料库),这一事实加剧了上述漏洞,导致模型在显式对齐调优过程中更易对原有能力产生灾难性遗忘。为应对这些挑战,我们提出MLLM-Protector——一种即插即用策略,结合了轻量级危害检测器与响应净化模块。危害检测器负责识别多模态大语言模型可能输出的有害内容,而净化器则修正这些输出以确保响应符合安全标准。该方法能有效缓解恶意视觉输入带来的风险,同时不牺牲模型的整体性能。实验结果表明,MLLM-Protector为多模态大语言模型安全这一此前未被充分关注的方面提供了稳健解决方案。