Discrete image tokenizers encode visual inputs as sequences of tokens from a finite vocabulary and are gaining popularity in multimodal systems, including encoder-only, encoder-decoder, and decoder-only models. However, unlike CLIP encoders, their vulnerability to adversarial attacks has not been explored. Ours being the first work studying this topic, we first formulate attacks that aim to perturb the features extracted by discrete tokenizers, and thus change the extracted tokens. These attacks are computationally efficient, application-agnostic, and effective across classification, multimodal retrieval, and captioning tasks. Second, to defend against this vulnerability, inspired by recent work on robust CLIP encoders, we fine-tune popular tokenizers with unsupervised adversarial training, keeping all other components frozen. While unsupervised and task-agnostic, our approach significantly improves robustness to both unsupervised and end-to-end supervised attacks and generalizes well to unseen tasks and data. Unlike supervised adversarial training, our approach can leverage unlabeled images, making it more versatile. Overall, our work highlights the critical role of tokenizer robustness in downstream tasks and presents an important step in the development of safe multimodal foundation models.
翻译:离散图像分词器将视觉输入编码为有限词汇表中的标记序列,在包括仅编码器、编码器-解码器及仅解码器模型在内的多模态系统中日益普及。然而,与CLIP编码器不同,其面对对抗攻击的脆弱性尚未得到充分探索。作为该领域的首项研究工作,我们首先构建了旨在扰动离散分词器所提取特征、从而改变输出标记的攻击方法。这些攻击具有计算高效性、任务无关性,并在分类、多模态检索及描述生成任务中均表现出显著有效性。其次,为防御此类脆弱性,受鲁棒CLIP编码器相关研究的启发,我们采用无监督对抗训练对主流分词器进行微调,同时冻结所有其他组件。尽管采用无监督且任务无关的训练方式,本方法能显著提升模型对无监督攻击与端到端有监督攻击的鲁棒性,并在未见任务与数据上展现出良好的泛化能力。与有监督对抗训练不同,本方法可利用未标注图像进行训练,因而具备更强的适用性。总体而言,本研究揭示了分词器鲁棒性在下游任务中的关键作用,为开发安全的多模态基础模型迈出了重要一步。