Vision-language models (VLMs) are vulnerable to adversarial image perturbations. Existing works based on adversarial training against task-specific adversarial examples are computationally expensive and often fail to generalize to unseen attack types. To address these limitations, we introduce Paraphrase-Decomposition-Aggregation (PDA), a training-free defense framework that leverages text augmentation to enhance VLM robustness under diverse adversarial image attacks. PDA performs prompt paraphrasing, question decomposition, and consistency aggregation entirely at test time, thus requiring no modification on the underlying models. To balance robustness and efficiency, we instantiate PDA as invariants that reduce the inference cost while retaining most of its robustness gains. Experiments on multiple VLM architectures and benchmarks for visual question answering, classification, and captioning show that PDA achieves consistent robustness gains against various adversarial perturbations while maintaining competitive clean accuracy, establishing a generic, strong and practical defense framework for VLMs during inference.
翻译:视觉语言模型(VLM)易受对抗性图像扰动的影响。现有基于针对特定任务对抗样本进行对抗训练的方法计算成本高昂,且通常难以泛化到未见过的攻击类型。为解决这些局限性,我们提出Paraphrase-Decomposition-Aggregation (PDA)——一种无需训练的防御框架,通过利用文本增强提升VLM在多种对抗图像攻击下的鲁棒性。PDA在测试阶段完全执行提示改写、问题分解与一致性聚合,因此无需对底层模型进行任何修改。为平衡鲁棒性与效率,我们将PDA实例化为不变式,在保留其大部分鲁棒性增益的同时降低推理成本。在多种VLM架构及视觉问答、分类、图像描述基准上的实验表明,PDA能在多种对抗扰动下实现一致的鲁棒性提升,同时保持有竞争力的干净准确率,从而在推理阶段为VLM建立了一种通用、强大且实用的防御框架。