Augmenting Large Language Models (LLMs) with image-understanding capabilities has resulted in a boom of high-performing Vision-Language models (VLMs). While studying the alignment of LLMs to human values has received widespread attention, the safety of VLMs has not received the same attention. In this paper, we explore the impact of jailbreaking on three state-of-the-art VLMs, each using a distinct modeling approach. By comparing each VLM to their respective LLM backbone, we find that each VLM is more susceptible to jailbreaking. We consider this as an undesirable outcome from visual instruction-tuning, which imposes a forgetting effect on an LLM's safety guardrails. Therefore, we provide recommendations for future work based on evaluation strategies that aim to highlight the weaknesses of a VLM, as well as take safety measures into account during visual instruction tuning.
翻译:赋予大语言模型图像理解能力,催生了高性能视觉-语言模型的蓬勃发展。尽管大语言模型与人类价值观对齐的研究已受到广泛关注,但视觉-语言模型的安全性尚未获得同等重视。本文探究了越狱攻击对三种采用不同建模方法的最先进视觉-语言模型的影响。通过将各视觉-语言模型与其对应的大语言模型主干进行对比,我们发现每个视觉-语言模型都更易遭受越狱攻击。我们认为这是视觉指令微调导致的不良后果——该过程使得大语言模型的安全防护机制产生遗忘效应。因此,我们基于评估策略提出未来工作建议,旨在凸显视觉-语言模型的薄弱环节,并在视觉指令微调过程中纳入安全考量。