With the advent and widespread deployment of Multimodal Large Language Models (MLLMs), the imperative to ensure their safety has become increasingly pronounced. However, with the integration of additional modalities, MLLMs are exposed to new vulnerabilities, rendering them prone to structured-based jailbreak attacks, where semantic content (e.g., "harmful text") has been injected into the images to mislead MLLMs. In this work, we aim to defend against such threats. Specifically, we propose \textbf{Ada}ptive \textbf{Shield} Prompting (\textbf{AdaShield}), which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks without fine-tuning MLLMs or training additional modules (e.g., post-stage content detector). Initially, we present a manually designed static defense prompt, which thoroughly examines the image and instruction content step by step and specifies response methods to malicious queries. Furthermore, we introduce an adaptive auto-refinement framework, consisting of a target MLLM and a LLM-based defense prompt generator (Defender). These components collaboratively and iteratively communicate to generate a defense prompt. Extensive experiments on the popular structure-based jailbreak attacks and benign datasets show that our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks without compromising the model's general capabilities evaluated on standard benign tasks. Our code is available at https://github.com/rain305f/AdaShield.
翻译:随着多模态大语言模型(MLLMs)的出现和广泛部署,确保其安全性的需求日益凸显。然而,随着额外模态的集成,MLLMs暴露出新的脆弱性,使其易受基于结构的越狱攻击——攻击者将语义内容(如"有害文本")注入图像中,以误导MLLMs。本研究旨在防御此类威胁。具体而言,我们提出**自适应防护提示**(**AdaShield**),该方法通过向输入前添加防御提示来保护MLLMs免受基于结构的越狱攻击,无需微调MLLMs或训练额外模块(如后阶段内容检测器)。首先,我们设计了一种人工编写的静态防御提示,该提示逐步细致地检查图像和指令内容,并指定对恶意查询的响应方式。进一步,我们引入自适应自动优化框架,包含目标MLLM和基于大语言模型的防御提示生成器(Defender)。这两个组件通过协同迭代通信生成防御提示。在主流基于结构的越狱攻击和良性数据集上的大量实验表明,我们的方法能够持续提升MLLMs对基于结构越狱攻击的鲁棒性,同时不会损害模型在标准良性任务中评估的通用能力。我们的代码开源于 https://github.com/rain305f/AdaShield。