Audio-text retrieval (ATR), which retrieves a relevant caption given an audio clip (A2T) and vice versa (T2A), has recently attracted much research attention. Existing methods typically aggregate information from each modality into a single vector for matching, but this sacrifices local details and can hardly capture intricate relationships within and between modalities. Furthermore, current ATR datasets lack comprehensive alignment information, and simple binary contrastive learning labels overlook the measurement of fine-grained semantic differences between samples. To counter these challenges, we present a novel ATR framework that comprehensively captures the matching relationships of multimodal information from different perspectives and finer granularities. Specifically, a fine-grained alignment method is introduced, achieving a more detail-oriented matching through a multiscale process from local to global levels to capture meticulous cross-modal relationships. In addition, we pioneer the application of cross-modal similarity consistency, leveraging intra-modal similarity relationships as soft supervision to boost more intricate alignment. Extensive experiments validate the effectiveness of our approach, outperforming previous methods by significant margins of at least 3.9% (T2A) / 6.9% (A2T) R@1 on the AudioCaps dataset and 2.9% (T2A) / 5.4% (A2T) R@1 on the Clotho dataset.
翻译:音频-文本检索(ATR)旨在根据音频片段检索对应文本描述(A2T)或反之亦然(T2A),近年来受到广泛关注。现有方法通常将各模态信息聚合成单一向量进行匹配,但这一做法牺牲了局部细节,难以捕捉模态内部及跨模态间的复杂关联。此外,当前ATR数据集缺乏全面的对齐信息,简单的二分类对比学习标签无法衡量样本间细粒度语义差异。为应对这些挑战,我们提出了一种新型ATR框架,从不同视角和更细粒度层面全面捕捉多模态信息的匹配关系。具体而言,我们引入了一种细粒度对齐方法,通过从局部到全局的多尺度过程实现更注重细节的匹配,以捕捉精细的跨模态关系。同时,我们开创性地应用跨模态相似性一致性,利用模态内相似性关系作为软监督信号增强更复杂的对齐效果。大量实验验证了本方法的有效性:在AudioCaps数据集上,T2A/A2T方向的R@1指标分别以至少3.9%/6.9%的显著优势超越既有方法;在Clotho数据集上,该指标提升幅度分别达2.9%(T2A)和5.4%(A2T)。