Recent advances in vision-language models (VLMs) have significantly enhanced the visual grounding task, which involves locating objects in an image based on natural language queries. Despite these advancements, the security of VLM-based grounding systems has not been thoroughly investigated. This paper reveals a novel and realistic vulnerability: the first multi-target backdoor attack on VLM-based visual grounding. Unlike prior attacks that rely on static triggers or fixed targets, we propose IAG, a method that dynamically generates input-aware, text-guided triggers conditioned on any specified target object description to execute the attack. This is achieved through a text-conditioned UNet that embeds imperceptible target semantic cues into visual inputs while preserving normal grounding performance on benign samples. We further develop a joint training objective that balances language capability with perceptual reconstruction to ensure imperceptibility, effectiveness, and stealth. Extensive experiments on multiple VLMs (e.g., LLaVA, InternVL, Ferret) and benchmarks (RefCOCO, RefCOCO+, RefCOCOg, Flickr30k Entities, and ShowUI) demonstrate that IAG achieves the best ASRs compared with other baselines on almost all settings without compromising clean accuracy, maintaining robustness against existing defenses, and exhibiting transferability across datasets and models. These findings underscore critical security risks in grounding-capable VLMs and highlight the need for further research on trustworthy multimodal understanding.
翻译:视觉语言模型(VLM)的最新进展显著提升了视觉定位任务的性能,该任务旨在根据自然语言查询定位图像中的目标对象。尽管取得了这些进步,基于VLM的定位系统的安全性尚未得到深入研究。本文揭示了一种新颖且现实的安全漏洞:针对基于VLM的视觉定位系统的首个多目标后门攻击。与以往依赖静态触发器或固定目标的攻击不同,我们提出了IAG方法,该方法能够根据任意指定的目标对象描述动态生成输入感知的、文本引导的触发器来执行攻击。这是通过一个文本条件化的UNet实现的,该网络将难以察觉的目标语义线索嵌入视觉输入中,同时在良性样本上保持正常的定位性能。我们进一步设计了一种联合训练目标,通过平衡语言能力与感知重建来确保攻击的隐蔽性、有效性和隐蔽性。在多个VLM(如LLaVA、InternVL、Ferret)和基准数据集(RefCOCO、RefCOCO+、RefCOCOg、Flickr30k Entities和ShowUI)上的大量实验表明,与所有基线方法相比,IAG在几乎全部设定下均取得了最优的攻击成功率,且未损害干净样本的准确率,同时保持了对现有防御方法的鲁棒性,并展现出跨数据集和模型的迁移能力。这些发现揭示了具备定位能力的VLM中存在的关键安全风险,并凸显了对可信多模态理解进行进一步研究的必要性。