Current mainstream approaches to addressing multimodal imbalance primarily focus on architectural modifications and optimization-based, often overlooking a quantitative analysis of the imbalance degree between modalities. To address this gap, our work introduces a novel method for the quantitative analysis of multi-modal imbalance, which in turn informs the design of a sample-level adaptive loss function.We begin by defining the "Modality Gap" as the difference between the Softmax scores of different modalities (e.g., audio and visual) for the ground-truth class prediction. Analysis of the Modality Gap distribution reveals that it can be effectively modeled by a bimodal Gaussian Mixture Model (GMM). These two components are found to correspond respectively to "modality-balanced" and "modality-imbalanced" data samples. Subsequently, we apply Bayes' theorem to compute the posterior probability of each sample belonging to these two distinct distributions.Informed by this quantitative analysis, we design a novel adaptive loss function with three objectives: (1) to minimize the overall Modality Gap; (2) to encourage the imbalanced sample distribution to shift towards the balanced one; and (3) to apply greater penalty weights to imbalanced samples. We employ a two-stage training strategy consisting of a warm-up phase followed by an adaptive training phase.Experimental results demonstrate that our approach achieves state-of-the-art (SOTA) performance on the public CREMA-D and AVE datasets, attaining accuracies of $80.65\%$ and $70.90\%$, respectively. This validates the effectiveness of our proposed methodology.
翻译:当前解决多模态不平衡的主流方法主要集中于架构修改和基于优化的策略,往往忽视了对模态间不平衡程度的定量分析。为弥补这一不足,本文提出了一种用于量化多模态不平衡的新方法,并据此设计了一种样本级自适应损失函数。我们首先将"模态间隙"定义为不同模态(例如音频与视觉)在真实类别预测上的Softmax分数之差。对模态间隙分布的分析表明,其可被双峰高斯混合模型有效建模。这两个分量分别被发现对应于"模态平衡"与"模态不平衡"的数据样本。随后,我们应用贝叶斯定理计算每个样本属于这两个不同分布的后验概率。基于此定量分析,我们设计了一种新颖的自适应损失函数,其具有三个目标:(1) 最小化整体模态间隙;(2) 促使不平衡样本分布向平衡分布偏移;(3) 对不平衡样本施加更大的惩罚权重。我们采用了一种两阶段训练策略,包括预热阶段和自适应训练阶段。实验结果表明,我们的方法在公开的CREMA-D和AVE数据集上取得了最先进的性能,准确率分别达到$80.65\%$和$70.90\%$,验证了所提方法的有效性。