Multi-modal embeddings encode images, sounds, texts, videos, etc. into a single embedding space, aligning representations across modalities (e.g., associate an image of a dog with a barking sound). We show that multi-modal embeddings can be vulnerable to an attack we call "adversarial illusions." Given an image or a sound, an adversary can perturb it so as to make its embedding close to an arbitrary, adversary-chosen input in another modality. This enables the adversary to align any image and any sound with any text. Adversarial illusions exploit proximity in the embedding space and are thus agnostic to downstream tasks. Using ImageBind embeddings, we demonstrate how adversarially aligned inputs, generated without knowledge of specific downstream tasks, mislead image generation, text generation, and zero-shot classification.
翻译:多模态嵌入将图像、声音、文本、视频等内容编码至同一嵌入空间,并跨模态对齐表征(例如,将狗的图像与吠叫声关联)。我们发现多模态嵌入可能易受一种称为“对抗性幻觉”的攻击。对于给定图像或声音,攻击者可对其进行扰动,使其嵌入与该攻击者选择的另一模态任意输入接近。这使得攻击者能够将任意图像、任意声音与任意文本对齐。对抗性幻觉利用了嵌入空间中的邻近性,因此对下游任务具有无关性。通过使用ImageBind嵌入,我们展示了在无需知晓特定下游任务的情况下生成的对抗性对齐输入,如何误导图像生成、文本生成及零样本分类任务。