Multi-modal embeddings encode texts, images, sounds, videos, etc., into a single embedding space, aligning representations across different modalities (e.g., associate an image of a dog with a barking sound). In this paper, we show that multi-modal embeddings can be vulnerable to an attack we call "adversarial illusions." Given an image or a sound, an adversary can perturb it to make its embedding close to an arbitrary, adversary-chosen input in another modality. These attacks are cross-modal and targeted: the adversary is free to align any image and any sound with any target of his choice. Adversarial illusions exploit proximity in the embedding space and are thus agnostic to downstream tasks and modalities, enabling a wholesale compromise of current and future downstream tasks and modalities not available to the adversary. Using ImageBind and AudioCLIP embeddings, we demonstrate how adversarially aligned inputs, generated without knowledge of specific downstream tasks, mislead image generation, text generation, zero-shot classification, and audio retrieval. We investigate transferability of illusions across different embeddings and develop a black-box version of our method that we use to demonstrate the first adversarial alignment attack on Amazon's commercial, proprietary Titan embedding. Finally, we analyze countermeasures and evasion attacks.
翻译:多模态嵌入将文本、图像、声音、视频等编码至同一嵌入空间,对齐不同模态的表征(例如,将一张狗的图片与狗叫声关联)。本文证明,多模态嵌入易受我们称为"对抗性幻觉"的攻击。对于给定图像或声音,攻击者可对其进行扰动,使其嵌入接近另一模态中任意由攻击者选择的输入。此类攻击具有跨模态和针对性特征:攻击者可自由将任意图像和声音与任意目标对齐。对抗性幻觉利用嵌入空间中的邻近性,因此对下游任务和模态具有不可知性,从而能够全面危害当前及未来攻击者无法获取的下游任务和模态。利用ImageBind和AudioCLIP嵌入,我们展示了在不知晓具体下游任务的情况下生成的对抗性对齐输入,如何误导图像生成、文本生成、零样本分类和音频检索。我们研究了幻觉在不同嵌入间的可迁移性,并开发了该方法的黑盒版本,以此首次展示了针对亚马逊商用专有Titan嵌入的对抗性对齐攻击。最后,我们分析了防御措施与规避攻击。