The deployment of multimodal large language models (MLLMs) has brought forth a unique vulnerability: susceptibility to malicious attacks through visual inputs. We delve into the novel challenge of defending MLLMs against such attacks. We discovered that images act as a "foreign language" that is not considered during alignment, which can make MLLMs prone to producing harmful responses. Unfortunately, unlike the discrete tokens considered in text-based LLMs, the continuous nature of image signals presents significant alignment challenges, which poses difficulty to thoroughly cover the possible scenarios. This vulnerability is exacerbated by the fact that open-source MLLMs are predominantly fine-tuned on limited image-text pairs that is much less than the extensive text-based pretraining corpus, which makes the MLLMs more prone to catastrophic forgetting of their original abilities during explicit alignment tuning. To tackle these challenges, we introduce MLLM-Protector, a plug-and-play strategy combining a lightweight harm detector and a response detoxifier. The harm detector's role is to identify potentially harmful outputs from the MLLM, while the detoxifier corrects these outputs to ensure the response stipulates to the safety standards. This approach effectively mitigates the risks posed by malicious visual inputs without compromising the model's overall performance. Our results demonstrate that MLLM-Protector offers a robust solution to a previously unaddressed aspect of MLLM security.
翻译:多模态大语言模型的部署带来了一种独特的脆弱性:易于遭受通过视觉输入发起的恶意攻击。我们深入探讨了防御此类攻击的新挑战。研究发现,图像作为一种在模型对齐过程中未被纳入考虑的"外语",可能使得多模态大语言模型倾向于生成有害响应。遗憾的是,与基于文本的语言模型中的离散令牌不同,图像信号的连续性质带来了显著的对齐挑战,导致难以全面覆盖所有可能场景。开源多模态大语言模型主要基于有限的图像-文本对进行微调,其规模远小于广泛的纯文本预训练语料库,这进一步加剧了上述脆弱性,使得模型在显式对齐微调过程中更易发生原有能力的灾难性遗忘。为应对这些挑战,我们提出了MLLM-Protector——一种即插即用策略,结合轻量级危害检测器与响应解毒器。危害检测器用于识别多模态大语言模型中潜在的有害输出,而解毒器则修正这些输出以确保响应符合安全标准。该方法能在不牺牲模型整体性能的前提下有效缓解恶意视觉输入带来的风险。实验结果表明,MLLM-Protector为多模态大语言模型安全这一此前未充分研究的方面提供了稳健的解决方案。