Refusing Safe Prompts for Multi-modal Large Language Models

Multimodal large language models (MLLMs) have become the cornerstone of today's generative AI ecosystem, sparking intense competition among tech giants and startups. In particular, an MLLM generates a text response given a prompt consisting of an image and a question. While state-of-the-art MLLMs use safety filters and alignment techniques to refuse unsafe prompts, in this work, we introduce MLLM-Refusal, the first method that induces refusals for safe prompts. In particular, our MLLM-Refusal optimizes a nearly-imperceptible refusal perturbation and adds it to an image, causing target MLLMs to likely refuse a safe prompt containing the perturbed image and a safe question. Specifically, we formulate MLLM-Refusal as a constrained optimization problem and propose an algorithm to solve it. Our method offers competitive advantages for MLLM model providers by potentially disrupting user experiences of competing MLLMs, since competing MLLM's users will receive unexpected refusals when they unwittingly use these perturbed images in their prompts. We evaluate MLLM-Refusal on four MLLMs across four datasets, demonstrating its effectiveness in causing competing MLLMs to refuse safe prompts while not affecting non-competing MLLMs. Furthermore, we explore three potential countermeasures -- adding Gaussian noise, DiffPure, and adversarial training. Our results show that they are insufficient: though they can mitigate MLLM-Refusal's effectiveness, they also sacrifice the accuracy and/or efficiency of the competing MLLM. The code is available at https://github.com/Sadcardation/MLLM-Refusal.

翻译：多模态大语言模型（MLLMs）已成为当今生成式人工智能生态系统的基石，引发了科技巨头与初创企业之间的激烈竞争。具体而言，MLLM 会根据包含图像和问题的提示生成文本响应。尽管最先进的 MLLM 使用安全过滤器和对齐技术来拒绝不安全的提示，但在这项工作中，我们提出了 MLLM-Refusal，这是首个诱导模型拒绝安全提示的方法。具体来说，我们的 MLLM-Refusal 优化一个几乎不可察觉的拒绝扰动并将其添加到图像中，导致目标 MLLM 很可能拒绝包含扰动图像和安全问题的安全提示。我们特别将 MLLM-Refusal 表述为一个约束优化问题，并提出一种算法来求解。该方法为 MLLM 模型提供商提供了竞争优势，因为它可能破坏竞争 MLLM 的用户体验——当竞争 MLLM 的用户在提示中无意使用了这些扰动图像时，他们将收到意外的拒绝响应。我们在四个数据集上对四种 MLLM 评估了 MLLM-Refusal，证明了该方法能有效促使竞争 MLLM 拒绝安全提示，同时不影响非竞争 MLLM。此外，我们探索了三种潜在的防御对策——添加高斯噪声、DiffPure 和对抗训练。结果表明这些对策均不充分：虽然它们能降低 MLLM-Refusal 的有效性，但也会牺牲竞争 MLLM 的准确性和/或效率。代码发布于 https://github.com/Sadcardation/MLLM-Refusal。