Multimodal large language models (MLLMs) exhibit strong capabilities across diverse applications, yet remain vulnerable to adversarial perturbations that distort their feature representations and induce erroneous predictions. To address this vulnerability, we propose the Feature-space Smoothing (FS) and theoretically prove that FS offers certified robustness on the feature representations of MLLMs. Specifically, FS transforms any feature encoder into a smoothed variant that is guaranteed to maintain a certified lower bound on the feature cosine similarity between clean and adversarial representations under $\ell_2$-bounded attacks. Moreover, we indicate that the value of this Feature Cosine Similarity Bound (FCSB) derived from FS can be improved by enlarging the defined Gaussian robustness score on the vanilla encoder. Building upon this, we introduce the Purifier and Smoothness Mapper (PSM), a plug-and-play module that improves the Gaussian robustness score of MLLMs and thus enhances their certified robustness under FS, without requiring any retraining on MLLMs. We demonstrate that the FS with PSM not only provides a strong theoretical robustness guarantee but also exhibits superior empirical performance compared to adversarial training. Extensive experiments across diverse MLLMs and downstream tasks indicate the effectiveness of the FS-PSM, reducing the Attack Success Rate (ASR) of various white-box attacks from nearly 90\% to about 1\%.
翻译:多模态大语言模型(MLLMs)在多种应用中展现出强大能力,但其特征表示易受对抗性扰动影响,导致预测错误。为应对此脆弱性,我们提出特征空间平滑(FS)方法,并从理论上证明FS能为MLLMs的特征表示提供可证明的鲁棒性保证。具体而言,FS可将任意特征编码器转换为平滑变体,该变体在$\ell_2$范数有界攻击下,能保证干净样本与对抗样本特征表示间的余弦相似度存在可证明的下界。进一步,我们指出通过提升原始编码器的高斯鲁棒性评分,可优化FS所推导出的特征余弦相似度下界值。基于此,我们提出即插即用模块——净化与平滑映射器(PSM),该模块能提升MLLMs的高斯鲁棒性评分,进而在无需对MLLMs进行重训练的前提下,增强FS框架下的可证明鲁棒性。实验表明,结合PSM的FS不仅能提供坚实的理论鲁棒性保证,其经验性能也优于对抗训练。在多类MLLMs及下游任务上的大量实验证实了FS-PSM的有效性,能将多种白盒攻击的成功率从近90%降至约1%。