The recent progress in text-based audio retrieval was largely propelled by the release of suitable datasets. Since the manual creation of such datasets is a laborious task, obtaining data from online resources can be a cheap solution to create large-scale datasets. We study the recently proposed SoundDesc benchmark dataset, which was automatically sourced from the BBC Sound Effects web page. In our analysis, we find that SoundDesc contains several duplicates that cause leakage of training data to the evaluation data. This data leakage ultimately leads to overly optimistic retrieval performance estimates in previous benchmarks. We propose new training, validation, and testing splits for the dataset that we make available online. To avoid weak contamination of the test data, we pool audio files that share similar recording setups. In our experiments, we find that the new splits serve as a more challenging benchmark.
翻译:文本驱动的音频检索领域的最新进展在很大程度上得益于合适数据集的发布。由于手动创建此类数据集是一项费力的任务,从在线资源获取数据成为构建大规模数据集的低成本解决方案。我们研究了近期提出的SoundDesc基准数据集,该数据集通过自动方式从BBC音效网页获取。在分析中,我们发现SoundDesc包含多个重复项,导致训练数据泄露至评估数据。这种数据泄露最终导致先前的基准测试产生过于乐观的检索性能评估。我们为数据集提出了新的训练集、验证集和测试集划分,并已在线公开。为避免测试数据的轻微污染,我们将具有相似录音设置的音频文件进行合并。实验结果表明,新的数据划分可作为更具挑战性的基准测试。