The robustness of Vision Language Models (VLMs) is commonly assessed through output-level invariance, implicitly assuming that stable predictions reflect stable multimodal processing. In this work, we argue that this assumption is insufficient. We introduce a representation-aware and frequency-aware evaluation framework that measures internal embedding drift, spectral sensitivity, and structural smoothness (spatial consistency of vision tokens), alongside standard label-based metrics. Applying this framework to modern VLMs across the SEEDBench, MMMU, and POPE datasets reveals three distinct failure modes. First, models frequently preserve predicted answers while undergoing substantial internal representation drift; for perturbations such as text overlays, this drift approaches the magnitude of inter-image variability, indicating that representations move to regions typically occupied by unrelated inputs despite unchanged outputs. Second, robustness does not improve with scale; larger models achieve higher accuracy but exhibit equal or greater sensitivity, consistent with sharper yet more fragile decision boundaries. Third, we find that perturbations affect tasks differently: they harm reasoning when they disrupt how models combine coarse and fine visual cues, but on the hallucination benchmarks, they can reduce false positives by making models generate more conservative answers.
翻译:视觉语言模型(VLMs)的鲁棒性通常通过输出层面的不变性来评估,这隐含地假设稳定的预测反映了稳定的多模态处理。本文认为这一假设并不充分。我们引入了一个表征感知与频率感知的评估框架,该框架在标准基于标签的指标之外,还测量内部嵌入漂移、频谱敏感性以及结构平滑性(视觉令牌的空间一致性)。将此框架应用于现代VLMs,在SEEDBench、MMMU和POPE数据集上进行评估,揭示了三种不同的失效模式。首先,模型在保持预测答案不变的同时,经常发生显著的内部表征漂移;对于文本叠加等扰动,这种漂移接近图像间变异的幅度,表明表征移动到了通常由不相关输入占据的区域,尽管输出未变。其次,鲁棒性并未随模型规模提升而改善;更大的模型虽然获得了更高的准确率,但表现出同等或更强的敏感性,这与更尖锐但也更脆弱的决策边界相一致。第三,我们发现扰动对不同任务的影响存在差异:当扰动破坏了模型如何结合粗粒度与细粒度视觉线索时,会损害推理能力;但在幻觉基准测试中,扰动可以通过促使模型生成更保守的答案来减少误报。