GPT-4V has attracted considerable attention due to its extraordinary capacity for integrating and processing multimodal information. At the same time, its ability of face recognition raises new safety concerns of privacy leakage. Despite researchers' efforts in safety alignment through RLHF or preprocessing filters, vulnerabilities might still be exploited. In our study, we introduce AutoJailbreak, an innovative automatic jailbreak technique inspired by prompt optimization. We leverage Large Language Models (LLMs) for red-teaming to refine the jailbreak prompt and employ weak-to-strong in-context learning prompts to boost efficiency. Furthermore, we present an effective search method that incorporates early stopping to minimize optimization time and token expenditure. Our experiments demonstrate that AutoJailbreak significantly surpasses conventional methods, achieving an Attack Success Rate (ASR) exceeding 95.3\%. This research sheds light on strengthening GPT-4V security, underscoring the potential for LLMs to be exploited in compromising GPT-4V integrity.
翻译:GPT-4V因其在整合与处理多模态信息方面的卓越能力而备受关注。与此同时,其人脸识别能力引发了隐私泄露的新安全担忧。尽管研究人员已通过RLHF或预处理过滤器进行安全对齐,但其漏洞仍可能被利用。在本研究中,我们提出AutoJailbreak——一种受提示优化启发的新型自动越狱技术。我们利用大型语言模型进行红队测试以优化越狱提示,并采用弱到强的上下文学习提示来提升效率。此外,我们提出一种结合早停机制的有效搜索方法,以最小化优化时间和令牌消耗。实验表明,AutoJailbreak显著超越传统方法,攻击成功率超过95.3%。本研究为加强GPT-4V安全性提供了启示,突显了大型语言模型在破坏GPT-4V完整性方面被利用的潜在风险。