Multimodal learning integrates diverse modalities but suffers from modality imbalance, where dominant modalities suppress weaker ones due to inconsistent convergence rates. Existing methods predominantly rely on static modulation or heuristics, overlooking sample-level distributional variations in prediction bias. Specifically, they fail to distinguish outlier samples where the modality gap is exacerbated by low data quality. We propose a framework to quantitatively diagnose and dynamically mitigate this imbalance at the sample level. We introduce the Modality Gap metric to quantify prediction discrepancies. Analysis reveals that this gap follows a bimodal distribution, indicating the coexistence of balanced and imbalanced sample subgroups. We employ a Gaussian Mixture Model (GMM) to explicitly model this distribution, leveraging Bayesian posterior probabilities for soft subgroup separation. Our two-stage framework comprises a Warm-up stage and an Adaptive Training stage. In the latter, a GMM-guided Adaptive Loss dynamically reallocates optimization priorities: it imposes stronger alignment penalties on imbalanced samples to rectify bias, while prioritizing fusion for balanced samples to maximize complementary information. Experiments on CREMA-D, AVE, and KineticSound demonstrate that our method significantly outperforms SOTA baselines. Furthermore, we show that fine-tuning on a GMM-filtered balanced subset serves as an effective data purification strategy, yielding substantial gains by eliminating extreme noisy samples even without the adaptive loss.
翻译:多模态学习整合了多种模态,但受模态不平衡问题困扰,即主导模态因收敛速度不一致而抑制较弱模态。现有方法主要依赖静态调制或启发式策略,忽视了预测偏差在样本层面的分布变化。具体而言,这些方法未能区分因数据质量低劣而加剧模态差距的异常样本。我们提出一个框架,以在样本层面定量诊断并动态缓解这种不平衡。我们引入模态差距度量来量化预测差异。分析表明,该差距遵循双峰分布,指示了平衡与不平衡样本子群的共存。我们采用高斯混合模型(GMM)显式建模此分布,并利用贝叶斯后验概率进行软子群分离。我们的两阶段框架包括预热阶段和自适应训练阶段。在后者中,一种GMM引导的自适应损失动态重新分配优化优先级:对不平衡样本施加更强的对齐惩罚以纠正偏差,同时对平衡样本优先考虑融合以最大化互补信息。在CREMA-D、AVE和KineticSound上的实验表明,我们的方法显著优于SOTA基线。此外,我们证明,在GMM过滤的平衡子集上进行微调可作为一种有效的数据净化策略,即使不使用自适应损失,通过消除极端噪声样本也能带来显著增益。