Large Vision-Language Models (LVLMs) unlock powerful multimodal reasoning but also expand the attack surface, particularly through adversarial inputs that conceal harmful goals in benign prompts. We propose SHIELD, a lightweight, model-agnostic preprocessing framework that couples fine-grained safety classification with category-specific guidance and explicit actions (Block, Reframe, Forward). Unlike binary moderators, SHIELD composes tailored safety prompts that enforce nuanced refusals or safe redirection without retraining. Across five benchmarks and five representative LVLMs, SHIELD consistently lowers jailbreak and non-following rates while preserving utility. Our method is plug-and-play, incurs negligible overhead, and is easily extendable to new attack types -- serving as a practical safety patch for both weakly and strongly aligned LVLMs.
翻译:大视觉语言模型(LVLMs)释放了强大的多模态推理能力,同时也扩大了攻击面,尤其是通过将有害目标隐藏在良性提示中的对抗性输入。我们提出SHIELD,一个轻量级、模型无关的预处理框架,它将细粒度安全分类与类别特定引导及显式操作(阻断、重构、转发)相结合。与二元审查机制不同,SHIELD通过组合定制化的安全提示来执行精细化的拒绝或安全重定向,而无需重新训练模型。在五个基准测试和五个代表性LVLM上的实验表明,SHIELD能持续降低越狱率和指令违背率,同时保持模型实用性。我们的方法即插即用,引入的开销可忽略不计,且易于扩展至新型攻击类型——为弱对齐和强对齐的LVLM提供了一个实用的安全补丁。