Learning from multiple modalities, such as audio and video, offers opportunities for leveraging complementary information, enhancing robustness, and improving contextual understanding and performance. However, combining such modalities presents challenges, especially when modalities differ in data structure, predictive contribution, and the complexity of their learning processes. It has been observed that one modality can potentially dominate the learning process, hindering the effective utilization of information from other modalities and leading to sub-optimal model performance. To address this issue the vast majority of previous works suggest to assess the unimodal contributions and dynamically adjust the training to equalize them. We improve upon previous work by introducing a multi-loss objective and further refining the balancing process, allowing it to dynamically adjust the learning pace of each modality in both directions, acceleration and deceleration, with the ability to phase out balancing effects upon convergence. We achieve superior results across three audio-video datasets: on CREMA-D, models with ResNet backbone encoders surpass the previous best by 1.9% to 12.4%, and Conformer backbone models deliver improvements ranging from 2.8% to 14.1% across different fusion methods. On AVE, improvements range from 2.7% to 7.7%, while on UCF101, gains reach up to 6.1%.
翻译:从多种模态(例如音频和视频)中学习,为利用互补信息、增强鲁棒性、提升上下文理解与性能提供了机会。然而,组合这些模态面临挑战,尤其在模态在数据结构、预测贡献度及学习过程复杂性方面存在差异时。研究发现,某一模态可能主导学习过程,阻碍其他模态信息的有效利用,导致模型性能次优。为解决此问题,以往绝大多数研究建议评估单模态贡献并动态调整训练以平衡各模态。我们在先前工作基础上进行了改进,引入多损失目标并进一步优化平衡过程,使其能够动态调整各模态的学习速率(包括加速与减速方向),并在收敛时逐步消除平衡效应。我们在三个音频-视频数据集上取得了优异结果:在CREMA-D上,采用ResNet骨干编码器的模型较此前最佳性能提升1.9%至12.4%,而Conformer骨干模型在不同融合方法下实现2.8%至14.1%的提升;在AVE数据集上,提升幅度为2.7%至7.7%;在UCF101上,最高提升达6.1%。