In this paper, we provide a new perspective on self-supervised speech models from how the training targets are obtained. We generalize the targets extractor into Offline Targets Extractor (Off-TE) and Online Targets Extractor (On-TE). Based on this, we propose a new multi-tasking learning framework for self-supervised learning, MT4SSL, which stands for Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets. MT4SSL uses the K-means algorithm as an Off-TE and a teacher network without gradients as an On-TE, respectively. Our model outperforms previous SSL methods by nontrivial margins on the LibriSpeech benchmark, and is comparable to or even better than the best-performing models with fewer data. Furthermore, we find that using both Off-TE and On-TE results in better convergence in the pre-training phase. With both effectiveness and efficiency, we think doing multi-task learning on self-supervised speech models from our perspective is a promising trend.
翻译:本文从训练目标的获取方式出发,为自监督语音模型提供了新的视角。我们将目标提取器概括为离线目标提取器(Off-TE)和在线目标提取器(On-TE)。基于此,我们提出了一种新的自监督学习多任务训练框架——MT4SSL(即通过整合多目标提升自监督语音表征学习)。MT4SSL分别使用K-means算法作为离线目标提取器,以及无梯度传递的教师网络作为在线目标提取器。我们的模型在LibriSpeech基准测试中显著优于先前的自监督学习方法,且在与使用更少数据的最佳性能模型相比时表现相当甚至更优。此外,我们发现同时使用离线与在线目标提取器能提升预训练阶段的收敛效果。基于其有效性与高效性,我们认为从这一视角对自监督语音模型进行多任务学习是一个有前景的研究方向。