Existing work on jailbreak Multimodal Large Language Models (MLLMs) has focused primarily on adversarial examples in model inputs, with less attention to vulnerabilities, especially in model API. To fill the research gap, we carry out the following work: 1) We discover a system prompt leakage vulnerability in GPT-4V. Through carefully designed dialogue, we successfully extract the internal system prompts of GPT-4V. This finding indicates potential exploitable security risks in MLLMs; 2) Based on the acquired system prompts, we propose a novel MLLM jailbreaking attack method termed SASP (Self-Adversarial Attack via System Prompt). By employing GPT-4 as a red teaming tool against itself, we aim to search for potential jailbreak prompts leveraging stolen system prompts. Furthermore, in pursuit of better performance, we also add human modification based on GPT-4's analysis, which further improves the attack success rate to 98.7\%; 3) We evaluated the effect of modifying system prompts to defend against jailbreaking attacks. Results show that appropriately designed system prompts can significantly reduce jailbreak success rates. Overall, our work provides new insights into enhancing MLLM security, demonstrating the important role of system prompts in jailbreaking. This finding could be leveraged to greatly facilitate jailbreak success rates while also holding the potential for defending against jailbreaks.
翻译:现有针对多模态大语言模型(MLLMs)的破解研究主要集中于模型输入中的对抗样本,而对模型API等安全漏洞的关注较少。为填补这一研究空白,我们开展了以下工作:1)发现GPT-4V存在系统提示泄露漏洞。通过精心设计的对话,我们成功提取了GPT-4V的内部系统提示。这一发现表明MLLMs中存在潜在的可利用安全风险;2)基于获取的系统提示,我们提出一种名为SASP(基于系统提示的自对抗攻击)的新型MLLM破解攻击方法。该方法以GPT-4作为针对自身的红队测试工具,旨在利用窃取的系统提示搜索潜在的破解提示。此外,为追求更优性能,我们在GPT-4分析基础上加入人工调整,将攻击成功率进一步提升至98.7%;3)评估了修改系统提示以防御破解攻击的效果。结果表明,合理设计的系统提示能显著降低破解成功率。总体而言,我们的工作为增强MLLM安全性提供了新思路,揭示了系统提示在破解攻击中的关键作用。这一发现既能大幅提升破解成功率,同时也具备抵御破解攻击的潜力。