Deep Neural Networks are vulnerable to adversarial examples, i.e., carefully crafted input samples that can cause models to make incorrect predictions with high confidence. To mitigate these vulnerabilities, adversarial training and detection-based defenses have been proposed to strengthen models in advance. However, most of these approaches focus on a single data modality, overlooking the relationships between visual patterns and textual descriptions of the input. In this paper, we propose a novel defense, Multi-Shield, designed to combine and complement these defenses with multi-modal information to further enhance their robustness. Multi-Shield leverages multi-modal large language models to detect adversarial examples and abstain from uncertain classifications when there is no alignment between textual and visual representations of the input. Extensive evaluations on CIFAR-10 and ImageNet datasets, using robust and non-robust image classification models, demonstrate that Multi-Shield can be easily integrated to detect and reject adversarial examples, outperforming the original defenses.
翻译:深度神经网络易受对抗样本的攻击,即精心构造的输入样本可导致模型以高置信度做出错误预测。为缓解此类脆弱性,研究者提出了对抗性训练和基于检测的防御方法以预先增强模型鲁棒性。然而,现有方法大多聚焦于单一数据模态,忽视了输入样本的视觉模式与文本描述之间的关联。本文提出一种新型防御框架Multi-Shield,旨在通过多模态信息融合与互补来进一步提升防御体系的鲁棒性。该框架利用多模态大语言模型检测对抗样本,并在输入的文本表征与视觉表征未对齐时主动放弃不确定的分类决策。在CIFAR-10和ImageNet数据集上,针对鲁棒与非鲁棒图像分类模型开展的广泛实验表明,Multi-Shield能够轻松集成以检测并拒斥对抗样本,其性能显著优于原始防御方法。