Existing work on jailbreak Multimodal Large Language Models (MLLMs) has focused primarily on adversarial examples in model inputs, with less attention to vulnerabilities in model APIs. To fill the research gap, we carry out the following work: 1) We discover a system prompt leakage vulnerability in GPT-4V. Through carefully designed dialogue, we successfully steal the internal system prompts of GPT-4V. This finding indicates potential exploitable security risks in MLLMs; 2)Based on the acquired system prompts, we propose a novel MLLM jailbreaking attack method termed SASP (Self-Adversarial Attack via System Prompt). By employing GPT-4 as a red teaming tool against itself, we aim to search for potential jailbreak prompts leveraging stolen system prompts. Furthermore, in pursuit of better performance, we also add human modification based on GPT-4's analysis, which further improves the attack success rate to 98.7\%; 3) We evaluated the effect of modifying system prompts to defend against jailbreaking attacks. Results show that appropriately designed system prompts can significantly reduce jailbreak success rates. Overall, our work provides new insights into enhancing MLLM security, demonstrating the important role of system prompts in jailbreaking, which could be leveraged to greatly facilitate jailbreak success rates while also holding the potential for defending against jailbreaks.
翻译:现有关于多模态大语言模型(MLLMs)越狱的研究主要集中于模型输入中的对抗样本,而对模型API中安全漏洞的关注较少。为填补这一研究空白,我们开展了以下工作:1)发现了GPT-4V存在系统提示泄露漏洞。通过精心设计的对话,我们成功窃取了GPT-4V的内部系统提示。这一发现揭示了MLLMs中可能存在可利用的安全风险;2)基于获取的系统提示,提出了一种新型MLLM越狱攻击方法,称为SASP(基于系统提示的自对抗攻击)。通过利用GPT-4作为自身的红队工具,我们旨在借助窃取的系统提示搜寻潜在的越狱提示。为进一步提升性能,我们还在GPT-4分析结果基础上增加了人工修改,使攻击成功率提升至98.7%;3)评估了通过修改系统提示来防御越狱攻击的效果。结果表明,合理设计的系统提示能显著降低越狱成功率。总体而言,我们的工作为增强MLLM安全性提供了新见解,揭示了系统提示在越狱中的关键作用——既可被利用来大幅提升越狱成功率,也具备防御越狱的潜力。