We introduce CASTELLA, a human-annotated audio benchmark for the task of audio moment retrieval (AMR). Although AMR has various useful potential applications, there is still no established benchmark with real-world data. The initial study of AMR trained the models solely on synthetic datasets. Moreover, the evaluation is based on an annotated dataset of fewer than 100 samples. This resulted in less reliable reported performance. To ensure performance for applications in real-world environments, we present CASTELLA, a large-scale manually annotated AMR dataset. CASTELLA consists of 1009, 213, and 640 audio recordings for training, validation, and test splits, respectively, which is 24 times larger than the previous dataset. We also establish a baseline model for AMR using CASTELLA. Our experiments demonstrate that a model fine-tuned on CASTELLA after pre-training on the synthetic data outperformed a model trained solely on the synthetic data by 10.4 points in Recall1@0.7. CASTELLA is publicly available in https://h-munakata.github.io/CASTELLA-demo/.
翻译:本文介绍了CASTELLA,一个用于音频片段检索任务的人工标注音频基准数据集。尽管音频片段检索具有多种潜在应用价值,但目前仍缺乏基于真实场景数据的成熟基准。该领域的初期研究仅使用合成数据集训练模型,且评估基于不足100个样本的标注数据,导致报告的性能指标可靠性不足。为保障模型在真实环境中的应用性能,我们提出了CASTELLA——一个大规模人工标注的音频片段检索数据集。该数据集包含分别用于训练、验证和测试的1009、213和640条音频记录,规模达到先前数据集的24倍。基于CASTELLA,我们建立了音频片段检索的基线模型。实验表明:在合成数据预训练基础上使用CASTELLA微调的模型,其Recall1@0.7指标比仅使用合成数据训练的模型高出10.4个百分点。CASTELLA已公开发布于https://h-munakata.github.io/CASTELLA-demo/。