Multimodal large language models (MLLMs) have shown impressive reasoning abilities, which, however, are also more vulnerable to jailbreak attacks than their LLM predecessors. Although still capable of detecting unsafe responses, we observe that safety mechanisms of the pre-aligned LLMs in MLLMs can be easily bypassed due to the introduction of image features. To construct robust MLLMs, we propose ECSO(Eyes Closed, Safety On), a novel training-free protecting approach that exploits the inherent safety awareness of MLLMs, and generates safer responses via adaptively transforming unsafe images into texts to activate intrinsic safety mechanism of pre-aligned LLMs in MLLMs. Experiments on five state-of-the-art (SoTA) MLLMs demonstrate that our ECSO enhances model safety significantly (e.g., a 37.6% improvement on the MM-SafetyBench (SD+OCR), and 71.3% on VLSafe for the LLaVA-1.5-7B), while consistently maintaining utility results on common MLLM benchmarks. Furthermore, we show that ECSO can be used as a data engine to generate supervised-finetuning (SFT) data for MLLM alignment without extra human intervention.
翻译:多模态大语言模型(MLLMs)展现出强大的推理能力,然而相较于其前身LLMs,它们也更容易遭受越狱攻击。尽管MLLMs仍具备检测不安全响应的能力,但我们观察到,由于图像特征的引入,MLLMs中预对齐LLMs的安全机制容易被绕过。为构建稳健的MLLMs,我们提出ECSO(闭眼保安全)——一种无需训练的新型保护方法,该方法利用MLLMs固有的安全感知能力,通过自适应地将不安全图像转换为文本,激活MLLMs中预对齐LLMs的内在安全机制,从而生成更安全的响应。在五个最先进(SoTA)MLLMs上的实验表明,我们的ECSO显著增强了模型安全性(例如,在LLaVA-1.5-7B模型上,MM-SafetyBench(SD+OCR)提升37.6%,VLSafe提升71.3%),同时持续保持常见MLLM基准测试中的性能效果。此外,我们证明ECSO可作为数据引擎,无需额外人工干预即可生成用于MLLM对齐的监督微调(SFT)数据。