Self-supervised learning (SSL) has recently shown remarkable results in closing the gap between supervised and unsupervised learning. The idea is to learn robust features that are invariant to distortions of the input data. Despite its success, this idea can suffer from a collapsing issue where the network produces a constant representation. To this end, we introduce SELFIE, a novel Self-supervised Learning approach for audio representation via Feature Diversity and Decorrelation. SELFIE avoids the collapsing issue by ensuring that the representation (i) maintains a high diversity among embeddings and (ii) decorrelates the dependencies between dimensions. SELFIE is pre-trained on the large-scale AudioSet dataset and its embeddings are validated on nine audio downstream tasks, including speech, music, and sound event recognition. Experimental results show that SELFIE outperforms existing SSL methods in several tasks.
翻译:自监督学习(SSL)最近在缩小监督学习与无监督学习之间的差距方面取得了显著成果。其核心思想是学习对输入数据失真具有鲁棒性的特征。尽管该方法取得了成功,但可能面临网络产生恒定表征的坍缩问题。为此,我们提出SELFIE——一种基于特征多样性与去相关化的新型音频表征自监督学习方法。SELFIE通过确保表征(i)在嵌入向量间保持高多样性,以及(ii)去除各维度间的相关性来避免坍缩问题。该方法在大规模AudioSet数据集上进行预训练,其嵌入向量在包括语音、音乐和声音事件识别在内的九项音频下游任务中得到验证。实验结果表明,SELFIE在多项任务中均优于现有自监督学习方法。