Vision-Language Models (VLMs) are increasingly susceptible to sophisticated adversarial attacks, including adaptive strategies specifically designed to bypass existing defenses. To address this vulnerability, we propose MirrorCheck, a robust and model-agnostic detection framework that operates effectively in both unimodal and multimodal settings. MirrorCheck leverages Text-to-Image (T2I) models to regenerate visual content from captions produced by the target model and assesses semantic consistency by comparing feature-space embeddings between the original and synthesized images. To enhance robustness against adaptive attacks, MirrorCheck introduces a stochastic defense strategy that randomly selects T2I generators and image encoders from a diverse model zoo. Additionally, we incorporate a novel One-Time-Use (OTU) perturbation applied to the selected encoder embeddings, regulated by a scaling factor, which decreases the effectiveness of adaptive attacks. Extensive experiments across multiple threat scenarios demonstrate that MirrorCheck consistently outperforms baseline methods, and maintains its utility even under strong adaptive adversarial conditions.
翻译:视觉语言模型(VLM)日益容易受到复杂对抗攻击的影响,包括专门设计用于绕过现有防御系统的自适应策略。为应对这一脆弱性,我们提出MirrorCheck——一种鲁棒且模型无关的检测框架,可在单模态与多模态场景下均有效运行。MirrorCheck利用文本到图像(T2I)模型,根据目标模型生成的描述重建视觉内容,并通过比较原始图像与合成图像的特征空间嵌入来评估语义一致性。为增强对自适应攻击的鲁棒性,MirrorCheck引入一种随机防御策略,从多样化的模型库中随机选择T2I生成器与图像编码器。此外,我们创新性地提出一种一次性扰动(OTU)机制,通过缩放因子调节所选编码器嵌入的扰动强度,从而削弱自适应攻击的有效性。在多种威胁场景下的广泛实验表明,MirrorCheck始终优于基线方法,即使在强自适应对抗条件下仍能保持其有效性。