In recent years, self-supervised learning has excelled for its capacity to learn robust feature representations from unlabelled data. Networks pretrained through self-supervision serve as effective feature extractors for downstream tasks, including Few-Shot Learning. While the evaluation of unsupervised approaches for few-shot learning is well-established in imagery, it is notably absent in acoustics. This study addresses this gap by assessing large-scale self-supervised models' performance in few-shot audio classification. Additionally, we explore the relationship between a model's few-shot learning capability and other downstream task benchmarks. Our findings reveal state-of-the-art performance in some few-shot problems such as SpeechCommandsv2, as well as strong correlations between speech-based few-shot problems and various downstream audio tasks.
翻译:近年来,自监督学习凭借从无标签数据中学习鲁棒特征表示的能力而表现卓越。通过自监督预训练的网络可作为下游任务(包括少样本学习)的有效特征提取器。尽管无监督方法在图像领域的少样本学习评估已较为成熟,但在声学领域仍明显缺失。本研究通过评估大规模自监督模型在少样本音频分类中的表现来填补这一空白。此外,我们探讨了模型少样本学习能力与其他下游任务基准之间的关联。研究结果揭示了在诸如SpeechCommandsv2等少样本问题中达到的最优性能,以及基于语音的少样本问题与各类下游音频任务之间的强相关性。