XSPA: Crafting Imperceptible X-Shaped Sparse Adversarial Perturbations for Transferable Attacks on VLMs

Vision-language models (VLMs) rely on a shared visual-textual representation space to perform tasks such as zero-shot classification, image captioning, and visual question answering (VQA). While this shared space enables strong cross-task generalization, it may also introduce a common vulnerability: small visual perturbations can propagate through the shared embedding space and cause correlated semantic failures across tasks. This risk is particularly important in interactive and decision-support settings, yet it remains unclear whether VLMs are robust to highly constrained, sparse, and geometrically fixed perturbations. To address this question, we propose X-shaped Sparse Pixel Attack (XSPA), an imperceptible structured attack that restricts perturbations to two intersecting diagonal lines. Compared with dense perturbations or flexible localized patches, XSPA operates under a much stricter attack budget and thus provides a more stringent test of VLM robustness. Within this sparse support, XSPA jointly optimizes a classification objective, cross-task semantic guidance, and regularization on perturbation magnitude and along-line smoothness, inducing transferable misclassification as well as semantic drift in captioning and VQA while preserving visual subtlety. Under the default setting, XSPA modifies only about 1.76% of image pixels. Experiments on the COCO dataset show that XSPA consistently degrades performance across all three tasks. Zero-shot accuracy drops by 52.33 points on OpenAI CLIP ViT-L/14 and 67.00 points on OpenCLIP ViT-B/16, while GPT-4-evaluated caption consistency decreases by up to 58.60 points and VQA correctness by up to 44.38 points. These results suggest that even highly sparse and visually subtle perturbations with fixed geometric priors can substantially disrupt cross-task semantics in VLMs, revealing a notable robustness gap in current multimodal systems.

翻译：视觉-语言模型依赖共享的视觉-文本表示空间执行零样本分类、图像描述和视觉问答等任务。虽然共享空间赋予模型强大的跨任务泛化能力，但也可能引入共同脆弱性：细微的视觉扰动可通过共享嵌入空间传播，引发跨任务关联语义错误。该风险在人机交互及决策支持场景中尤为关键，但目前尚不明确视觉-语言模型在高约束、稀疏且几何固定的扰动下是否具备鲁棒性。针对此问题，我们提出X形稀疏像素攻击（XSPA），这是一种将扰动限制在两条交叉对角线上的不可感知结构化攻击方法。相较于密集扰动或灵活局部斑块攻击，XSPA在更严格的攻击预算下运行，从而更严苛地检验视觉-语言模型的鲁棒性。在该稀疏支撑域内，XSPA联合优化分类目标、跨任务语义引导及扰动幅度与沿线平滑度的正则化项，在保持视觉隐蔽性的同时诱发可迁移性错误分类，以及图像描述和视觉问答任务中的语义漂移。默认设置下，XSPA仅修改约1.76%的图像像素。在COCO数据集上的实验表明，XSPA持续降低三项任务性能：零样本准确率在OpenAI CLIP ViT-L/14上下降52.33个百分点，在OpenCLIP ViT-B/16上下降67.00个百分点；GPT-4评估的图像描述一致性最高下降58.60个百分点，视觉问答正确率最高下降44.38个百分点。这些结果揭示，即使采用固定几何先验的高度稀疏、视觉隐蔽性扰动，仍能显著破坏视觉-语言模型的跨任务语义一致性，凸显当前多模态系统存在显著的鲁棒性缺口。