Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?

Various jailbreak attacks have been proposed to red-team Large Language Models (LLMs) and revealed the vulnerable safeguards of LLMs. Besides, some methods are not limited to the textual modality and extend the jailbreak attack to Multimodal Large Language Models (MLLMs) by perturbing the visual input. However, the absence of a universal evaluation benchmark complicates the performance reproduction and fair comparison. Besides, there is a lack of comprehensive evaluation of closed-source state-of-the-art (SOTA) models, especially MLLMs, such as GPT-4V. To address these issues, this work first builds a comprehensive jailbreak evaluation dataset with 1445 harmful questions covering 11 different safety policies. Based on this dataset, extensive red-teaming experiments are conducted on 11 different LLMs and MLLMs, including both SOTA proprietary models and open-source models. We then conduct a deep analysis of the evaluated results and find that (1) GPT4 and GPT-4V demonstrate better robustness against jailbreak attacks compared to open-source LLMs and MLLMs. (2) Llama2 and Qwen-VL-Chat are more robust compared to other open-source models. (3) The transferability of visual jailbreak methods is relatively limited compared to textual jailbreak methods. The dataset and code can be found https://github.com/chenxshuo/RedTeamingGPT4V

翻译：为对大语言模型进行红队测试，研究者们提出了多种越狱攻击方法，揭示了大语言模型安全防护机制的脆弱性。此外，部分方法不局限于文本模态，通过扰动视觉输入将越狱攻击扩展至多模态大语言模型。然而，由于缺乏统一的评估基准，性能复现与公平比较变得复杂。同时，针对闭源前沿模型（尤其是多模态大语言模型，如GPT-4V）的全面评估尚存不足。为解决这些问题，本研究首先构建了一个包含1445个有害问题、覆盖11项不同安全策略的综合性越狱评估数据集。基于此数据集，我们对11种不同的大语言模型和多模态大语言模型（包括前沿专有模型和开源模型）进行了广泛的红队测试实验。随后，我们对评估结果进行了深入分析，发现：（1）与开源大语言模型及多模态大语言模型相比，GPT-4和GPT-4V展现出更强的抗越狱攻击鲁棒性；（2）Llama2和Qwen-VL-Chat相较于其他开源模型具有更好的鲁棒性；（3）与文本越狱方法相比，视觉越狱方法的可迁移性相对有限。数据集与代码可见：https://github.com/chenxshuo/RedTeamingGPT4V

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

37+阅读 · 2019年10月17日