Vision and language models (VL) are known to exploit unrobust indicators in individual modalities (e.g., introduced by distributional biases) instead of focusing on relevant information in each modality. That a unimodal model achieves similar accuracy on a VL task to a multimodal one, indicates that so-called unimodal collapse occurred. However, accuracy-based tests fail to detect e.g., when the model prediction is wrong, while the model used relevant information from a modality. Instead, we propose MM-SHAP, a performance-agnostic multimodality score based on Shapley values that reliably quantifies in which proportions a multimodal model uses individual modalities. We apply MM-SHAP in two ways: (1) to compare models for their average degree of multimodality, and (2) to measure for individual models the contribution of individual modalities for different tasks and datasets. Experiments with six VL models -- LXMERT, CLIP and four ALBEF variants -- on four VL tasks highlight that unimodal collapse can occur to different degrees and in different directions, contradicting the wide-spread assumption that unimodal collapse is one-sided. Based on our results, we recommend MM-SHAP for analysing multimodal tasks, to diagnose and guide progress towards multimodal integration. Code available at \url{https://github.com/Heidelberg-NLP/MM-SHAP}.
翻译:视觉与语言模型(VL)常利用单模态中不可靠的指标(例如由分布偏差引入的指标)而非聚焦于各模态的相关信息。当单模态模型在多模态任务上达到与多模态模型相似的准确性时,表明发生了所谓的“单模态坍缩”。然而,基于准确性的测试无法检测到例如模型预测错误但实际使用了某模态相关信息的情况。为此,我们提出MM-SHAP——一种基于夏普利值且无关性能的多模态评分方法,能够可靠量化多模态模型对各模态的使用比例。我们将MM-SHAP应用于两个场景:(1)比较不同模型的平均多模态程度;(2)衡量单个模型在不同任务和数据集下对各模态的贡献度。在四个VL任务上对六种VL模型(LXMERT、CLIP及四种ALBEF变体)的实验表明,单模态坍缩可能以不同程度和不同方向发生,这与普遍认为单模态坍缩是单向的假设相矛盾。基于实验结果,我们推荐使用MM-SHAP分析多模态任务,以诊断并引导多模态融合的发展。代码详见\url{https://github.com/Heidelberg-NLP/MM-SHAP}。