Matching raw audio signals with textual descriptions requires understanding the audio's content and the description's semantics and then drawing connections between the two modalities. This paper investigates a hybrid retrieval system that utilizes audio metadata as an additional clue to understand the content of audio signals before matching them with textual queries. We experimented with metadata often attached to audio recordings, such as keywords and natural-language descriptions, and we investigated late and mid-level fusion strategies to merge audio and metadata. Our hybrid approach with keyword metadata and late fusion improved the retrieval performance over a content-based baseline by 2.36 and 3.69 pp. mAP@10 on the ClothoV2 and AudioCaps benchmarks, respectively.
翻译:将原始音频信号与文本描述进行匹配,需要理解音频内容与描述语义,并建立两种模态间的关联。本文研究一种混合检索系统,该系统在将音频信号与文本查询匹配前,利用音频元数据作为额外线索来理解音频内容。我们实验了音频记录中常附带的元数据(如关键词和自然语言描述),并探索了通过后期与中层融合策略来合并音频与元数据。采用关键词元数据与后期融合的混合方法,在ClothoV2和AudioCaps基准测试中,检索性能相较于基于内容的基线分别提升了2.36和3.69个百分点的mAP@10。