In recent years, self-supervised learning has excelled for its capacity to learn robust feature representations from unlabelled data. Networks pretrained through self-supervision serve as effective feature extractors for downstream tasks, including Few-Shot Learning. While the evaluation of unsupervised approaches for few-shot learning is well-established in imagery, it is notably absent in acoustics. This study addresses this gap by assessing large-scale self-supervised models' performance in few-shot audio classification. Additionally, we explore the relationship between a model's few-shot learning capability and other downstream task benchmarks. Our findings reveal state-of-the-art performance in some few-shot problems such as SpeechCommandsv2, as well as strong correlations between speech-based few-shot problems and various downstream audio tasks.
翻译:近年来,自监督学习因能从无标注数据中学习鲁棒特征表示而表现卓越。通过自监督预训练的网络可作为下游任务(包括少样本学习)的有效特征提取器。尽管无监督方法在图像领域的少样本学习评估已较为成熟,但在声学领域却明显缺失。本研究通过评估大规模自监督模型在少样本音频分类中的性能填补了这一空白。此外,我们探讨了模型的少样本学习能力与其他下游任务基准之间的关系。研究结果表明,该模型在SpeechCommandsv2等部分少样本问题中达到了最先进性能,同时基于语音的少样本问题与各类下游音频任务之间存在强相关性。