In text-audio retrieval (TAR) tasks, due to the heterogeneity of contents between text and audio, the semantic information contained in the text is only similar to certain frames within the audio. Yet, existing works aggregate the entire audio without considering the text, such as mean-pooling over the frames, which is likely to encode misleading audio information not described in the given text. In this paper, we present a text-aware attention pooling (TAP) module for TAR, which is essentially a scaled dot product attention for a text to attend to its most semantically similar frames. Furthermore, previous methods only conduct the softmax for every single-side retrieval, ignoring the potential cross-retrieval information. By exploring the intrinsic prior of each text-audio pair, we introduce a prior matrix revised (PMR) loss to filter the hard case with high (or low) text-to-audio but low (or high) audio-to-text similarity scores, thus achieving the dual optimal match. Experiments show that our TAP significantly outperforms various text-agnostic pooling functions. Moreover, our PMR loss also shows stable performance gains on multiple datasets.
翻译:在文本-音频检索(TAR)任务中,由于文本与音频之间的内容异质性,文本所包含的语义信息仅与音频中的特定帧相似。然而,现有方法在聚合整个音频时未考虑文本信息(例如对帧进行均值池化),这可能导致编码与给定文本描述无关的误导性音频信息。本文提出一种面向TAR的文本感知注意力池化(TAP)模块,其本质是一种缩放点积注意力机制,使文本能够聚焦于语义最相似的音频帧。此外,先前方法仅对单侧检索执行Softmax操作,忽略了潜在的跨检索信息。通过探索每个文本-音频对的内在先验,我们引入先验矩阵修正(PMR)损失函数,以过滤具有高(或低)文本到音频相似度但低(或高)音频到文本相似度得分的困难样本,从而达成双向最优匹配。实验表明,我们的TAP显著优于各类文本无关池化函数,同时PMR损失在多个数据集上展现出稳定的性能提升。