CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries

We introduce CASTELLA, a human-annotated audio benchmark for the task of audio moment retrieval (AMR). Although AMR has various useful potential applications, there is still no established benchmark with real-world data. The initial study of AMR trained the models solely on synthetic datasets. Moreover, the evaluation is based on an annotated dataset of fewer than 100 samples. This resulted in less reliable reported performance. To ensure performance for applications in real-world environments, we present CASTELLA, a large-scale manually annotated AMR dataset. CASTELLA consists of 1009, 213, and 640 audio recordings for training, validation, and test splits, respectively, which is 24 times larger than the previous dataset. We also establish a baseline model for AMR using CASTELLA. Our experiments demonstrate that a model fine-tuned on CASTELLA after pre-training on the synthetic data outperformed a model trained solely on the synthetic data by 10.4 points in [email protected]. CASTELLA is publicly available in https://h-munakata.github.io/CASTELLA-demo/.

翻译：本文介绍了CASTELLA，一个用于音频片段检索任务的人工标注音频基准数据集。尽管音频片段检索具有多种潜在应用价值，但目前仍缺乏基于真实场景数据的成熟基准。该领域的初期研究仅使用合成数据集训练模型，且评估基于不足100个样本的标注数据，导致报告的性能指标可靠性不足。为保障模型在真实环境中的应用性能，我们提出了CASTELLA——一个大规模人工标注的音频片段检索数据集。该数据集包含分别用于训练、验证和测试的1009、213和640条音频记录，规模达到先前数据集的24倍。基于CASTELLA，我们建立了音频片段检索的基线模型。实验表明：在合成数据预训练基础上使用CASTELLA微调的模型，其[email protected]指标比仅使用合成数据训练的模型高出10.4个百分点。CASTELLA已公开发布于https://h-munakata.github.io/CASTELLA-demo/。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2025】VideoLucy：用于长视频理解的深度记忆回溯机制

专知会员服务

9+阅读 · 2025年10月15日

探索长视频生成的最新趋势

专知会员服务

23+阅读 · 2024年12月30日

【NeurIPS2024】CA-SSLR：面向广义语音处理的条件感知自监督学习表征

专知会员服务

15+阅读 · 2024年12月6日

Sora如何复现? 百万级真实提示库数据集，用于文本到视频扩散模型

专知会员服务

33+阅读 · 2024年3月13日