Multimodal Large Language Models demonstrate strong performance on multimodal benchmarks, yet often exhibit poor robustness when exposed to spurious modality interference, such as irrelevant text in vision understanding, or irrelevant visual content in question answering. At its core, modality interference refers to cases where spurious signals from non-essential modalities distort model decisions, which we systematically analyze through causal, perturbation-based diagnostic experiments. To address this problem, we propose a unified finetuning framework that combines heuristic and adversarial perturbation-based data augmentation with output-level consistency regularization between original and perturbed inputs. Extensive experiments across image-heavy, text-heavy, and multimodal benchmarks, spanning multiple MLLM architectures and model scales, demonstrate consistent improvements in unimodal robustness and generalization, while improving standard multimodal performance.
翻译:多模态大语言模型在多模态基准测试中展现出强大性能,但在面对虚假模态干扰时——例如视觉理解任务中的无关文本,或问答任务中的无关视觉内容——其鲁棒性往往表现不佳。模态干扰的核心在于非必要模态的虚假信号扭曲了模型决策,我们通过基于因果关系的扰动诊断实验对此进行了系统性分析。为解决该问题,我们提出了统一的微调框架,该框架结合了启发式与基于对抗性扰动的数据增强技术,并在原始输入与扰动输入之间施加输出层面的一致性正则化约束。通过在图像密集型、文本密集型及多模态基准上的大量实验,涵盖多种MLLM架构与模型规模,本方法在单模态鲁棒性与泛化能力方面展现出持续改进,同时提升了标准多模态性能。