Audio-visual learning helps to comprehensively understand the world by fusing practical information from multiple modalities. However, recent studies show that the imbalanced optimization of uni-modal encoders in a joint-learning model is a bottleneck to enhancing the model's performance. We further find that the up-to-date imbalance-mitigating methods fail on some audio-visual fine-grained tasks, which have a higher demand for distinguishable feature distribution. Fueled by the success of cosine loss that builds hyperspherical feature spaces and achieves lower intra-class angular variability, this paper proposes Multi-Modal Cosine loss, MMCosine. It performs a modality-wise $L_2$ normalization to features and weights towards balanced and better multi-modal fine-grained learning. We demonstrate that our method can alleviate the imbalanced optimization from the perspective of weight norm and fully exploit the discriminability of the cosine metric. Extensive experiments prove the effectiveness of our method and the versatility with advanced multi-modal fusion strategies and up-to-date imbalance-mitigating methods.
翻译:视听学习通过融合多模态的实用信息,有助于全面理解世界。然而,近期研究表明,联合学习模型中单模态编码器的优化不均衡是提升模型性能的瓶颈。我们进一步发现,当前最新的缓解失衡方法在部分对可区分特征分布要求更高的视听细粒度任务中失效。受余弦损失成功构建超球面特征空间并实现更低类内角度变异性的启发,本文提出多模态余弦损失MMCosine。该方法对特征与权重执行模态级$L_2$归一化,以实现均衡且更优的多模态细粒度学习。我们证明,该方法能从权重范数角度缓解优化失衡,并充分挖掘余弦度量的判别能力。大量实验证明了本方法的有效性,以及其与先进多模态融合策略及最新缓解失衡方法的通用兼容性。