Modality discrepancies have perpetually posed significant challenges within the realm of Automated Audio Captioning (AAC) and across all multi-modal domains. Facilitating models in comprehending text information plays a pivotal role in establishing a seamless connection between the two modalities of text and audio. While recent research has focused on closing the gap between these two modalities through contrastive learning, it is challenging to bridge the difference between both modalities using only simple contrastive loss. This paper introduces Enhance Depth of Text Comprehension (EDTC), which enhances the model's understanding of text information from three different perspectives. First, we propose a novel fusion module, FUSER, which aims to extract shared semantic information from different audio features through feature fusion. We then introduced TRANSLATOR, a novel alignment module designed to align audio features and text features along the tensor level. Finally, the weights are updated by adding momentum to the twin structure so that the model can learn information about both modalities at the same time. The resulting method achieves state-of-the-art performance on AudioCaps datasets and demonstrates results comparable to the state-of-the-art on Clotho datasets.
翻译:模态差异在自动音频描述(AAC)及所有多模态领域中始终构成重大挑战。促进模型理解文本信息,对于在文本与音频两种模态间建立无缝连接至关重要。尽管近期研究通过对比学习致力于弥合这两种模态间的差距,但仅依靠简单的对比损失函数难以有效桥接二者差异。本文提出增强文本理解深度方法(EDTC),从三个不同维度提升模型对文本信息的理解能力。首先,我们设计新型融合模块FUSER,旨在通过特征融合从不同音频特征中提取共享语义信息;其次,引入对齐模块TRANSLATOR,以实现音频特征与文本特征在张量层面的对齐;最后,通过引入动量机制更新孪生网络结构的权重,使模型能同步学习两种模态的信息。所提方法在AudioCaps数据集上达到最先进性能,并在Clotho数据集上展现出可与当前最优方法相媲美的结果。