There has been an increasing interest in the alignment of large language models (LLMs) with human values. However, the safety issues of their integration with a vision module, or vision language models (VLMs), remain relatively underexplored. In this paper, we propose a novel jailbreaking attack against VLMs, aiming to bypass their safety barrier when a user inputs harmful instructions. A scenario where our poisoned (image, text) data pairs are included in the training data is assumed. By replacing the original textual captions with malicious jailbreak prompts, our method can perform jailbreak attacks with the poisoned images. Moreover, we analyze the effect of poison ratios and positions of trainable parameters on our attack's success rate. For evaluation, we design two metrics to quantify the success rate and the stealthiness of our attack. Together with a list of curated harmful instructions, a benchmark for measuring attack efficacy is provided. We demonstrate the efficacy of our attack by comparing it with baseline methods.
翻译:随着大语言模型与人类价值观对齐的研究日益受到关注,其与视觉模块集成的安全风险——即视觉-语言模型的安全问题——仍相对缺乏探索。本文提出一种针对视觉-语言模型的越狱攻击方法,旨在绕过模型对用户恶意指令的安全防护屏障。我们假设训练数据中包含被污染的(图像,文本)数据对。通过将原始文本描述替换为恶意越狱提示,本方法可利用污染图像实施越狱攻击。此外,我们分析了投毒比例与可训练参数位置对攻击成功率的影响。为评估攻击效果,我们设计了两项量化成功率与隐蔽性的指标,并联合精心整理的恶意指令集构建了攻击效能基准测试。通过与基线方法的对比实验,验证了本攻击方法的有效性。