This paper explores grading text-based audio retrieval relevances with crowdsourcing assessments. Given a free-form text (e.g., a caption) as a query, crowdworkers are asked to grade audio clips using numeric scores (between 0 and 100) to indicate their judgements of how much the sound content of an audio clip matches the text, where 0 indicates no content match at all and 100 indicates perfect content match. We integrate the crowdsourced relevances into training and evaluating text-based audio retrieval systems, and evaluate the effect of using them together with binary relevances from audio captioning. Conventionally, these binary relevances are defined by captioning-based audio-caption pairs, where being positive indicates that the caption describes the paired audio, and being negative applies to all other pairs. Experimental results indicate that there is no clear benefit from incorporating crowdsourced relevances alongside binary relevances when the crowdsourced relevances are binarized for contrastive learning. Conversely, the results suggest that using only binary relevances defined by captioning-based audio-caption pairs is sufficient for contrastive learning.
翻译:本文探讨利用众包评估对基于文本的音频检索相关性进行分级。给定自由形式的文本(如描述性语句)作为查询,众包工作者需对音频片段进行数值评分(0到100分),以反映其对音频内容与文本匹配程度的判断:0分表示内容完全不相关,100分表示内容完全匹配。我们将众包相关性评分融入基于文本的音频检索系统的训练与评估中,并评估其与音频描述生成的二元相关性联合使用的效果。传统上,二元相关性由基于音频描述生成的音频-描述对定义:正样本表示描述与配对音频一致,负样本则适用于所有其他配对。实验结果表明,当众包相关性评分被二值化用于对比学习时,将其与二元相关性结合并未带来明显收益。相反,仅使用基于音频描述生成的音频-描述对定义的二元相关性对对比学习而言已足够。