In this paper, we introduce UnFuSeD, a novel approach to leverage self-supervised learning and reduce the need for large amounts of labeled data for audio classification. Unlike prior works, which directly fine-tune a self-supervised pre-trained encoder on a target dataset, we use the encoder to generate pseudo-labels for unsupervised fine-tuning before the actual fine-tuning step. We first train an encoder using a novel self-supervised learning algorithm (SSL) on an unlabeled audio dataset. Then, we use that encoder to generate pseudo-labels on our target task dataset via clustering the extracted representations. These pseudo-labels are then used to guide self-distillation on a randomly initialized model, which we call unsupervised fine-tuning. Finally, the resultant encoder is then fine-tuned on our target task dataset. Through UnFuSeD, we propose the first system that moves away from generic SSL paradigms in literature, which pre-train and fine-tune the same encoder, and present a novel self-distillation-based system to leverage SSL pre-training for low-resource audio classification. In practice, UnFuSeD achieves state-of-the-art results on the LAPE Benchmark, significantly outperforming all our baselines. Additionally, UnFuSeD allows us to achieve this at a 40% reduction in the number of parameters over the previous state-of-the-art system. We make all our codes publicly available.
翻译:本文提出UnFuSeD,一种利用自监督学习并减少音频分类对大量标注数据依赖的新方法。不同于先前直接对目标数据集微调自监督预训练编码器的工作,我们采用编码器在实际微调步骤前生成伪标签进行无监督微调。首先,在无标签音频数据集上使用新型自监督学习算法训练编码器;接着,通过聚类提取的表示,用该编码器为目标任务数据集生成伪标签;随后,利用这些伪标签引导随机初始化模型的自蒸馏过程,即无监督微调;最后,对所得编码器在目标任务数据集上进行微调。通过UnFuSeD,我们首次提出脱离已有文献中预训练与微调同一编码器的通用自监督范式,构建了基于自蒸馏的新型系统,用于低资源音频分类的自监督预训练。实验表明,UnFuSeD在LAPE基准测试中取得最优结果,显著超越所有基线方法。同时,UnFuSeD在参数量较先前最优系统减少40%的情况下仍能实现该性能。我们将公开所有代码。