Vision-language alignment in Large Vision-Language Models (LVLMs) successfully enables LLMs to understand visual input. However, we find that existing vision-language alignment methods fail to transfer the existing safety mechanism for text in LLMs to vision, which leads to vulnerabilities in toxic image. To explore the cause of this problem, we give the insightful explanation of where and how the safety mechanism of LVLMs operates and conduct comparative analysis between text and vision. We find that the hidden states at the specific transformer layers play a crucial role in the successful activation of safety mechanism, while the vision-language alignment at hidden states level in current methods is insufficient. This results in a semantic shift for input images compared to text in hidden states, therefore misleads the safety mechanism. To address this, we propose a novel Text-Guided vision-language Alignment method (TGA) for LVLMs. TGA retrieves the texts related to input vision and uses them to guide the projection of vision into the hidden states space in LLMs. Experiments show that TGA not only successfully transfers the safety mechanism for text in basic LLMs to vision in vision-language alignment for LVLMs without any safety fine-tuning on the visual modality but also maintains the general performance on various vision tasks (Safe and Good).
翻译:大型视觉语言模型(LVLMs)中的视觉-语言对齐成功使大语言模型(LLMs)能够理解视觉输入。然而,我们发现现有的视觉-语言对齐方法未能将LLMs中针对文本的现有安全机制迁移到视觉模态,这导致模型在面对有害图像时存在脆弱性。为探究此问题的成因,我们对LVLMs安全机制在何处以及如何运作提供了深入的解释,并对文本与视觉模态进行了对比分析。我们发现特定Transformer层的隐藏状态在成功激活安全机制中起着关键作用,而现有方法在隐藏状态层面的视觉-语言对齐并不充分。这导致输入图像在隐藏状态空间中相较于文本产生语义偏移,从而误导了安全机制。为解决此问题,我们提出了一种新颖的文本引导视觉-语言对齐方法(TGA)用于LVLMs。TGA检索与输入视觉内容相关的文本,并利用这些文本来指导视觉特征向LLMs隐藏状态空间的投影。实验表明,TGA不仅成功地将基础LLMs中针对文本的安全机制迁移到LVLMs的视觉-语言对齐中,且无需在视觉模态上进行任何安全微调,同时还在多种视觉任务上保持了通用性能(安全且良好)。