Visual token compression is widely used to accelerate large vision-language models (LVLMs) by pruning or merging visual tokens, yet its adversarial robustness remains unexplored. We show that existing encoder-based attacks cannot fully disclose the robustness vulnerabilities of compressed LVLMs, due to an optimization-inference mismatch: perturbations are optimized on the full-token representation, while inference is performed through a token-compression bottleneck. To address this gap, we propose the Compression-AliGnEd attack (CAGE), which aligns perturbation optimization with compression inference without assuming access to the deployed compression mechanism or its token budget. CAGE combines (i) expected feature disruption, which concentrates distortion on tokens likely to survive across plausible budgets, and (ii) rank distortion alignment, which actively aligns token distortions with rank scores to promote the retention of highly distorted evidence. Across diverse representative plug-and-play compression mechanisms and datasets, our results show that CAGE consistently achieves lower robust accuracy than the baseline. This work highlights that robustness assessments ignoring compression can be overly optimistic, calling for compression-aware security evaluation and defenses for efficient LVLMs.
翻译:视觉令牌压缩通过剪枝或合并视觉令牌广泛用于加速大视觉-语言模型(LVLMs),但其对抗鲁棒性尚未得到探索。我们表明,现有基于编码器的攻击无法完全揭示压缩LVLMs的鲁棒性漏洞,原因在于优化与推理阶段存在失配:扰动是在完整令牌表示上优化的,而推理则通过令牌压缩瓶颈进行。为弥合这一差距,我们提出压缩对齐攻击(CAGE),该方法在无需假设已知部署的压缩机制或其令牌预算的前提下,将扰动优化与压缩推理对齐。CAGE结合了(i)预期特征破坏——将失真集中到可能在可行预算下存活的令牌上,以及(ii)秩失真对齐——主动将令牌失真与秩评分对齐,以促进高失真证据的保留。在多种代表性的即插即用压缩机制和数据集上,我们的结果表明,CAGE始终能实现比基线更低的鲁棒准确率。这项工作强调忽略压缩的鲁棒性评估可能过于乐观,呼吁针对高效LVLMs开展考虑压缩的安全性评估与防御。