We introduce DECAR, a self-supervised pre-training approach for learning general-purpose audio representations. Our system is based on clustering: it utilizes an offline clustering step to provide target labels that act as pseudo-labels for solving a prediction task. We develop on top of recent advances in self-supervised learning for computer vision and design a lightweight, easy-to-use self-supervised pre-training scheme. We pre-train DECAR embeddings on a balanced subset of the large-scale Audioset dataset and transfer those representations to 9 downstream classification tasks, including speech, music, animal sounds, and acoustic scenes. Furthermore, we conduct ablation studies identifying key design choices and also make all our code and pre-trained models publicly available.
翻译:我们提出DECAR,一种用于学习通用音频表征的自监督预训练方法。该系统基于聚类技术:利用离线的聚类步骤生成目标标签,这些标签作为伪标签用于预测任务的求解。我们在计算机视觉领域自监督学习最新进展的基础上,设计了一种轻量级、易于使用的自监督预训练方案。我们在大规模Audioset数据集的一个平衡子集上预训练DECAR嵌入,并将这些表征迁移至9项下游分类任务,涵盖语音、音乐、动物声音及声学场景。此外,通过消融实验识别关键设计选择,并公开所有代码及预训练模型。