Vision-language alignment in Large Vision-Language Models (LVLMs) successfully enables LLMs to understand visual input. However, we find that existing vision-language alignment methods fail to transfer the existing safety mechanism for text in LLMs to vision, which leads to vulnerabilities in toxic image. To explore the cause of this problem, we give the insightful explanation of where and how the safety mechanism of LVLMs operates and conduct comparative analysis between text and vision. We find that the hidden states at the specific transformer layers play a crucial role in the successful activation of safety mechanism, while the vision-language alignment at hidden states level in current methods is insufficient. This results in a semantic shift for input images compared to text in hidden states, therefore misleads the safety mechanism. To address this, we propose a novel Text-Guided vision-language Alignment method (TGA) for LVLMs. TGA retrieves the texts related to input vision and uses them to guide the projection of vision into the hidden states space in LLMs. Experiments show that TGA not only successfully transfers the safety mechanism for text in basic LLMs to vision in vision-language alignment for LVLMs without any safety fine-tuning on the visual modality but also maintains the general performance on various vision tasks (Safe and Good).
翻译:大型视觉语言模型(LVLMs)中的视觉语言对齐成功使大语言模型(LLMs)能够理解视觉输入。然而,我们发现现有的视觉语言对齐方法未能将LLMs中针对文本的现有安全机制迁移至视觉模态,这导致了对有害图像的脆弱性。为探究此问题的成因,我们对LVLMs安全机制的运作位置与方式给出了深入解释,并对文本与视觉模态进行了对比分析。我们发现,特定Transformer层的隐藏状态在安全机制的成功激活中起着关键作用,而现有方法在隐藏状态层面的视觉语言对齐不足。这导致输入图像在隐藏状态空间相比文本产生语义偏移,从而误导了安全机制。为解决此问题,我们提出了一种新颖的面向LVLMs的文本引导视觉语言对齐方法(TGA)。TGA检索与输入视觉内容相关的文本,并利用这些文本来引导视觉信息投影至LLMs的隐藏状态空间。实验表明,TGA不仅成功地将基础LLMs中针对文本的安全机制迁移至LVLMs的视觉语言对齐中(无需对视觉模态进行任何安全微调),同时还在多种视觉任务上保持了通用性能(安全且性能良好)。