Despite progress in audio classification, a generalization gap remains between speech and other sound domains, such as environmental sounds and music. Models trained for speech tasks often fail to perform well on environmental or musical audio tasks, and vice versa. While self-supervised (SSL) audio representations offer an alternative, there has been limited exploration of scaling both model and dataset sizes for SSL-based general audio classification. We introduce Dasheng, a simple SSL audio encoder, based on the efficient masked autoencoder framework. Trained with 1.2 billion parameters on 272,356 hours of diverse audio, Dasheng obtains significant performance gains on the HEAR benchmark. It outperforms previous works on CREMA-D, LibriCount, Speech Commands, VoxLingua, and competes well in music and environment classification. Dasheng features inherently contain rich speech, music, and environmental information, as shown in nearest-neighbor classification experiments. Code is available https://github.com/richermans/dasheng/.
翻译:尽管音频分类领域取得了进展,但语音与其他声音领域(如环境声音和音乐)之间仍存在泛化差距。针对语音任务训练的模型通常在环境或音乐音频任务上表现不佳,反之亦然。虽然自监督音频表征提供了一种替代方案,但针对基于自监督学习的通用音频分类,在模型和数据集规模扩展方面的探索仍然有限。我们提出了Dasheng,一个基于高效掩码自编码器框架的简单自监督音频编码器。该模型以12亿参数在272,356小时多样化音频数据上训练,在HEAR基准测试中取得了显著的性能提升。它在CREMA-D、LibriCount、Speech Commands、VoxLingua等数据集上超越了先前工作,并在音乐和环境分类任务中表现出竞争力。最近邻分类实验表明,Dasheng的特征本质上包含了丰富的语音、音乐和环境信息。代码已公开:https://github.com/richermans/dasheng/。