Multi-modal Large Language Models (MLLMs) have recently achieved enhanced performance across various vision-language tasks including visual grounding capabilities. However, the adversarial robustness of visual grounding remains unexplored in MLLMs. To fill this gap, we use referring expression comprehension (REC) as an example task in visual grounding and propose three adversarial attack paradigms as follows. Firstly, untargeted adversarial attacks induce MLLMs to generate incorrect bounding boxes for each object. Besides, exclusive targeted adversarial attacks cause all generated outputs to the same target bounding box. In addition, permuted targeted adversarial attacks aim to permute all bounding boxes among different objects within a single image. Extensive experiments demonstrate that the proposed methods can successfully attack visual grounding capabilities of MLLMs. Our methods not only provide a new perspective for designing novel attacks but also serve as a strong baseline for improving the adversarial robustness for visual grounding of MLLMs.
翻译:多模态大语言模型(MLLMs)近期在包括视觉定位能力在内的多项视觉-语言任务中展现出更强的性能。然而,MLLMs中视觉定位的对抗鲁棒性尚未得到探索。为填补这一空白,我们以视觉定位中的指代表达式理解(REC)任务为例,提出三种对抗攻击范式:首先,无目标对抗攻击诱导MLLMs对每个物体生成错误的边界框;其次,排他性目标对抗攻击使所有输出结果均指向相同的目标边界框;此外,置换目标对抗攻击旨在将单张图像中不同物体的所有边界框进行置换。大量实验表明,所提方法能够成功攻击MLLMs的视觉定位能力。我们的方法不仅为设计新型攻击提供了新视角,也为提升MLLMs视觉定位的对抗鲁棒性建立了强基线。