Large Visual Language Models (VLMs) such as GPT-4 have achieved remarkable success in generating comprehensive and nuanced responses, surpassing the capabilities of large language models. However, with the integration of visual inputs, new security concerns emerge, as malicious attackers can exploit multiple modalities to achieve their objectives. This has led to increasing attention on the vulnerabilities of VLMs to jailbreak. Most existing research focuses on generating adversarial images or nonsensical image collections to compromise these models. However, the challenge of leveraging meaningful images to produce targeted textual content using the VLMs' logical comprehension of images remains unexplored. In this paper, we explore the problem of logical jailbreak from meaningful images to text. To investigate this issue, we introduce a novel dataset designed to evaluate flowchart image jailbreak. Furthermore, we develop a framework for text-to-text jailbreak using VLMs. Finally, we conduct an extensive evaluation of the framework on GPT-4o and GPT-4-vision-preview, with jailbreak rates of 92.8% and 70.0%, respectively. Our research reveals significant vulnerabilities in current VLMs concerning image-to-text jailbreak. These findings underscore the need for a deeper examination of the security flaws in VLMs before their practical deployment.
翻译:大型视觉语言模型(VLMs),如GPT-4,在生成全面且细致的回应方面取得了显著成功,其能力超越了大型语言模型。然而,随着视觉输入的整合,新的安全问题也随之出现,恶意攻击者可能利用多种模态来实现其目标。这导致人们越来越关注VLM对越狱攻击的脆弱性。现有研究大多集中于生成对抗性图像或无意义的图像集合来攻破这些模型。然而,如何利用有意义的图像,借助VLM对图像的逻辑理解能力来生成有针对性的文本内容,这一挑战尚未得到探索。在本文中,我们探讨了从有意义的图像到文本的逻辑越狱问题。为了研究此问题,我们引入了一个新颖的数据集,旨在评估流程图图像的越狱攻击。此外,我们开发了一个利用VLM进行文本到文本越狱的框架。最后,我们在GPT-4o和GPT-4-vision-preview上对该框架进行了广泛评估,其越狱成功率分别为92.8%和70.0%。我们的研究揭示了当前VLM在图像到文本越狱方面存在显著漏洞。这些发现强调了在VLM实际部署之前,需要对其安全缺陷进行更深入的研究。