Multi-modal embeddings encode texts, images, thermal images, sounds, and videos into a single embedding space, aligning representations across different modalities (e.g., associate an image of a dog with a barking sound). In this paper, we show that multi-modal embeddings can be vulnerable to an attack we call "adversarial illusions." Given an image or a sound, an adversary can perturb it to make its embedding close to an arbitrary, adversary-chosen input in another modality. These attacks are cross-modal and targeted: the adversary can align any image or sound with any target of his choice. Adversarial illusions exploit proximity in the embedding space and are thus agnostic to downstream tasks and modalities, enabling a wholesale compromise of current and future tasks, as well as modalities not available to the adversary. Using ImageBind and AudioCLIP embeddings, we demonstrate how adversarially aligned inputs, generated without knowledge of specific downstream tasks, mislead image generation, text generation, zero-shot classification, and audio retrieval. We investigate transferability of illusions across different embeddings and develop a black-box version of our method that we use to demonstrate the first adversarial alignment attack on Amazon's commercial, proprietary Titan embedding. Finally, we analyze countermeasures and evasion attacks.
翻译:多模态嵌入将文本、图像、热成像、声音和视频编码到统一的嵌入空间中,实现不同模态间表征的对齐(例如将狗的图像与吠叫声相关联)。本文揭示多模态嵌入易受我们称为"对抗性幻象"的攻击。给定图像或声音,攻击者可通过扰动使其嵌入向量在另一模态中任意接近攻击者选定的目标输入。此类攻击具有跨模态与目标定向特性:攻击者可使任意图像或声音与其选择的任意目标对齐。对抗性幻象利用嵌入空间中的邻近关系,因此与下游任务及模态无关,能够全面危及当前及未来任务,甚至攻击者未知的模态。基于ImageBind与AudioCLIP嵌入,我们展示了在未知具体下游任务情况下生成的对抗性对齐输入如何误导图像生成、文本生成、零样本分类及音频检索任务。我们研究了幻象在不同嵌入模型间的可迁移性,开发了黑盒版本方法,并首次在亚马逊商业专有Titan嵌入系统上实现了对抗性对齐攻击。最后,我们分析了防御措施与规避攻击。