Vision-Language Models (VLMs) inherit adversarial vulnerabilities of Large Language Models (LLMs), which are further exacerbated by their multimodal nature. Existing defenses, including adversarial training, input transformations, and heuristic detection, are computationally expensive, architecture-dependent, and fragile against adaptive attacks. We introduce EigenShield, an inference-time defense leveraging Random Matrix Theory to quantify adversarial disruptions in high-dimensional VLM representations. Unlike prior methods that rely on empirical heuristics, EigenShield employs the spiked covariance model to detect structured spectral deviations. Using a Robustness-based Nonconformity Score (RbNS) and quantile-based thresholding, it separates causal eigenvectors, which encode semantic information, from correlational eigenvectors that are susceptible to adversarial artifacts. By projecting embeddings onto the causal subspace, EigenShield filters adversarial noise without modifying model parameters or requiring adversarial training. This architecture-independent, attack-agnostic approach significantly reduces the attack success rate, establishing spectral analysis as a principled alternative to conventional defenses. Our results demonstrate that EigenShield consistently outperforms all existing defenses, including adversarial training, UNIGUARD, and CIDER.
翻译:视觉语言模型(VLMs)继承了大型语言模型(LLMs)的对抗性脆弱性,其多模态特性进一步加剧了这一问题。现有防御方法,包括对抗训练、输入变换和启发式检测,存在计算成本高、依赖特定架构且对自适应攻击脆弱等缺点。本文提出EigenShield,一种利用随机矩阵理论在推理时量化高维VLM表示中对抗扰动的防御方法。与依赖经验启发式方法的先前工作不同,EigenShield采用尖峰协方差模型来检测结构化的谱偏差。通过使用基于鲁棒性的非一致性评分(RbNS)和分位数阈值化,该方法将编码语义信息的因果特征向量与易受对抗伪影影响的相关性特征向量分离开来。通过将嵌入投影到因果子空间,EigenShield能够在不修改模型参数或进行对抗训练的情况下过滤对抗噪声。这种与架构无关、攻击无关的方法显著降低了攻击成功率,确立了谱分析作为传统防御方法的原理性替代方案。我们的实验结果表明,EigenShield在包括对抗训练、UNIGUARD和CIDER在内的所有现有防御方法中均表现出持续优越的性能。