Multi-modal models have shown a promising capability to effectively integrate information from various sources, yet meanwhile, they are found vulnerable to pervasive perturbations, such as uni-modal attacks and missing conditions. To counter these perturbations, robust multi-modal representations are highly expected, which are positioned well away from the discriminative multi-modal decision boundary. In this paper, different from conventional empirical studies, we focus on a commonly used joint multi-modal framework and theoretically discover that larger uni-modal representation margins and more reliable integration for modalities are essential components for achieving higher robustness. This discovery can further explain the limitation of multi-modal robustness and the phenomenon that multi-modal models are often vulnerable to attacks on the specific modality. Moreover, our analysis reveals how the widespread issue, that the model has different preferences for modalities, limits the multi-modal robustness by influencing the essential components and could lead to attacks on the specific modality highly effective. Inspired by our theoretical finding, we introduce a training procedure called Certifiable Robust Multi-modal Training (CRMT), which can alleviate this influence from modality preference and explicitly regulate essential components to significantly improve robustness in a certifiable manner. Our method demonstrates substantial improvements in performance and robustness compared with existing methods. Furthermore, our training procedure can be easily extended to enhance other robust training strategies, highlighting its credibility and flexibility.
翻译:多模态模型在整合多源信息方面展现出显著潜力,但同时也被发现易受普遍扰动(如单模态攻击和缺失条件)的影响。为应对这些扰动,鲁棒的多模态表示需要远离判别性多模态决策边界。本文不同于传统实证研究,聚焦于通用联合多模态框架,从理论上发现:更大的单模态表示间隔与更可靠的模态整合是提升鲁棒性的关键要素。该发现可进一步解释多模态鲁棒性的局限性,以及多模态模型常易受特定模态攻击的现象。此外,我们的分析揭示了模型对不同模态存在偏好的普遍问题,如何通过影响关键要素限制多模态鲁棒性,并导致针对特定模态的攻击高度有效。受理论发现启发,我们提出名为可认证鲁棒多模态训练(CRMT)的训练流程,该方法可缓解模态偏好带来的影响,并显式调控关键要素,从而以可认证的方式显著提升鲁棒性。与现有方法相比,本方法在性能与鲁棒性上均实现显著提升。此外,该训练流程可便捷扩展至其他鲁棒训练策略,彰显其可信度与灵活性。