Limited diversity in standardized benchmarks for evaluating audio representation learning (ARL) methods may hinder systematic comparison of current methods' capabilities. We present ARCH, a comprehensive benchmark for evaluating ARL methods on diverse audio classification domains, covering acoustic events, music, and speech. ARCH comprises 12 datasets, that allow us to thoroughly assess pre-trained SSL models of different sizes. ARCH streamlines benchmarking of ARL techniques through its unified access to a wide range of domains and its ability to readily incorporate new datasets and models. To address the current lack of open-source, pre-trained models for non-speech audio, we also release new pre-trained models that demonstrate strong performance on non-speech datasets. We argue that the presented wide-ranging evaluation provides valuable insights into state-of-the-art ARL methods, and is useful to pinpoint promising research directions.
翻译:用于评估音频表征学习方法的标准化基准多样性有限,可能阻碍当前方法能力的系统性比较。我们提出ARCH——一个涵盖声学事件、音乐与语音的综合性基准,用于评估不同音频分类领域的表征学习方法。该基准包含12个数据集,可全面评估不同规模的预训练自监督学习模型。ARCH通过统一接入多领域数据并支持便捷整合新数据集与模型,简化了音频表征学习技术的基准测试流程。针对非语音音频领域开源预训练模型匮乏的现状,我们还发布了在非语音数据集上表现优异的新预训练模型。我们论证,所呈现的广泛评估为当前最优的音频表征学习方法提供了宝贵见解,有助于明确具有前景的研究方向。