Various jailbreak attacks have been proposed to red-team Large Language Models (LLMs) and revealed the vulnerable safeguards of LLMs. Besides, some methods are not limited to the textual modality and extend the jailbreak attack to Multimodal Large Language Models (MLLMs) by perturbing the visual input. However, the absence of a universal evaluation benchmark complicates the performance reproduction and fair comparison. Besides, there is a lack of comprehensive evaluation of closed-source state-of-the-art (SOTA) models, especially MLLMs, such as GPT-4V. To address these issues, this work first builds a comprehensive jailbreak evaluation dataset with 1445 harmful questions covering 11 different safety policies. Based on this dataset, extensive red-teaming experiments are conducted on 11 different LLMs and MLLMs, including both SOTA proprietary models and open-source models. We then conduct a deep analysis of the evaluated results and find that (1) GPT4 and GPT-4V demonstrate better robustness against jailbreak attacks compared to open-source LLMs and MLLMs. (2) Llama2 and Qwen-VL-Chat are more robust compared to other open-source models. (3) The transferability of visual jailbreak methods is relatively limited compared to textual jailbreak methods. The dataset and code can be found https://github.com/chenxshuo/RedTeamingGPT4V
翻译:为对大语言模型进行红队测试,研究者们提出了多种越狱攻击方法,揭示了大语言模型安全防护机制的脆弱性。此外,部分方法不局限于文本模态,通过扰动视觉输入将越狱攻击扩展至多模态大语言模型。然而,由于缺乏统一的评估基准,性能复现与公平比较变得复杂。同时,针对闭源前沿模型(尤其是多模态大语言模型,如GPT-4V)的全面评估尚存不足。为解决这些问题,本研究首先构建了一个包含1445个有害问题、覆盖11项不同安全策略的综合性越狱评估数据集。基于此数据集,我们对11种不同的大语言模型和多模态大语言模型(包括前沿专有模型和开源模型)进行了广泛的红队测试实验。随后,我们对评估结果进行了深入分析,发现:(1)与开源大语言模型及多模态大语言模型相比,GPT-4和GPT-4V展现出更强的抗越狱攻击鲁棒性;(2)Llama2和Qwen-VL-Chat相较于其他开源模型具有更好的鲁棒性;(3)与文本越狱方法相比,视觉越狱方法的可迁移性相对有限。数据集与代码可见:https://github.com/chenxshuo/RedTeamingGPT4V