Most existing audio-text retrieval (ATR) methods focus on constructing contrastive pairs between whole audio clips and complete caption sentences, while ignoring fine-grained cross-modal relationships, e.g., short segments and phrases or frames and words. In this paper, we introduce a hierarchical cross-modal interaction (HCI) method for ATR by simultaneously exploring clip-sentence, segment-phrase, and frame-word relationships, achieving a comprehensive multi-modal semantic comparison. Besides, we also present a novel ATR framework that leverages auxiliary captions (AC) generated by a pretrained captioner to perform feature interaction between audio and generated captions, which yields enhanced audio representations and is complementary to the original ATR matching branch. The audio and generated captions can also form new audio-text pairs as data augmentation for training. Experiments show that our HCI significantly improves the ATR performance. Moreover, our AC framework also shows stable performance gains on multiple datasets.
翻译:现有音频-文本检索(ATR)方法大多聚焦于构建整段音频片段与完整描述语句之间的对比对,忽视了细粒度的跨模态关联(例如,短片段与短语、帧与词之间的关系)。本文提出一种面向ATR的分层跨模态交互(HCI)方法,同时挖掘片段-语句、分段-短语以及帧-词之间的关联,实现全面的多模态语义对比。此外,我们提出一种新颖的ATR框架,利用预训练描述器生成的辅助描述(AC)进行音频与生成描述之间的特征交互,从而获得增强的音频表示,并与原始ATR匹配分支形成互补。音频与生成描述还可构成新的音频-文本对,作为数据增强用于训练。实验表明,我们的HCI方法显著提升了ATR性能,而AC框架在多个数据集上也展现出稳定的性能增益。