Machine learning models are known to be vulnerable to adversarial attacks, but traditional attacks have mostly focused on single-modalities. With the rise of large multi-modal models (LMMs) like CLIP, which combine vision and language capabilities, new vulnerabilities have emerged. However, prior work in multimodal targeted attacks aim to completely change the model's output to what the adversary wants. In many realistic scenarios, an adversary might seek to make only subtle modifications to the output, so that the changes go unnoticed by downstream models or even by humans. We introduce Hiding-in-Plain-Sight (HiPS) attacks, a novel class of adversarial attacks that subtly modifies model predictions by selectively concealing target object(s), as if the target object was absent from the scene. We propose two HiPS attack variants, HiPS-cls and HiPS-cap, and demonstrate their effectiveness in transferring to downstream image captioning models, such as CLIP-Cap, for targeted object removal from image captions.
翻译:机器学习模型已知易受对抗性攻击,但传统攻击主要集中于单模态场景。随着如CLIP这类融合视觉与语言能力的大型多模态模型的兴起,新的安全漏洞随之显现。然而,先前关于多模态定向攻击的研究旨在将模型输出完全改变为攻击者期望的结果。在许多实际场景中,攻击者可能仅寻求对输出进行细微修改,以使这些变化不被下游模型甚至人类察觉。本文提出隐藏式攻击,这是一类新颖的对抗性攻击方法,通过选择性隐藏目标物体来微妙地改变模型预测,使目标物体在场景中如同不存在。我们设计了两种HiPS攻击变体:HiPS-cls与HiPS-cap,并验证了其在迁移至下游图像描述模型时的有效性,例如在CLIP-Cap模型中实现从图像描述中定向移除目标物体。