Visual token compression is widely used to accelerate large vision-language models (LVLMs) by pruning or merging visual tokens, yet its adversarial robustness remains unexplored. We show that existing encoder-based attacks can substantially overestimate the robustness of compressed LVLMs, due to an optimization-inference mismatch: perturbations are optimized on the full-token representation, while inference is performed through a token-compression bottleneck. To address this gap, we propose the Compression-AliGnEd attack (CAGE), which aligns perturbation optimization with compression inference without assuming access to the deployed compression mechanism or its token budget. CAGE combines (i) expected feature disruption, which concentrates distortion on tokens likely to survive across plausible budgets, and (ii) rank distortion alignment, which actively aligns token distortions with rank scores to promote the retention of highly distorted evidence. Across diverse representative plug-and-play compression mechanisms and datasets, our results show that CAGE consistently achieves lower robust accuracy than the baseline. This work highlights that robustness assessments ignoring compression can be overly optimistic, calling for compression-aware security evaluation and defenses for efficient LVLMs.
翻译:视觉令牌压缩通过剪枝或合并视觉令牌被广泛用于加速大规模视觉语言模型,但其对抗鲁棒性尚未得到充分探索。我们发现,现有的基于编码器的攻击方法可能严重高估压缩后LVLMs的鲁棒性,这源于优化-推断失配:扰动在完整令牌表示上进行优化,而推断则通过令牌压缩瓶颈执行。为弥补这一差距,我们提出压缩对齐攻击方法,该方法将扰动优化与压缩推断对齐,且无需假设已知部署的压缩机制或其令牌预算。CAGE结合了(i)期望特征破坏——将失真集中于可能在不同合理预算下存活的令牌,以及(ii)秩失真对齐——主动将令牌失真与秩分数对齐,以促进高度失真证据的保留。在多种具有代表性的即插即用压缩机制和数据集上的实验结果表明,CAGE始终能获得比基线更低的鲁棒准确率。本研究表明,忽略压缩的鲁棒性评估可能过于乐观,呼吁对高效LVLMs进行压缩感知的安全性评估与防御机制建设。