In this paper, we provide a new perspective on self-supervised speech models from how the self-training targets are obtained. We generalize the targets extractor into Offline Targets Extractor (Off-TE) and Online Targets Extractor (On-TE). Based on this, we propose a new multi-tasking learning framework for self-supervised learning, MT4SSL, which stands for Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets. MT4SSL refers to two typical models, HuBERT and data2vec, which use the K-means algorithm as an Off-TE and a teacher network without gradients as an On-TE, respectively. Our model outperforms previous SSL methods by nontrivial margins on the LibriSpeech benchmark, and is comparable to or even better than the best-performing models with no need for that much data. Furthermore, we find that using both Off-TE and On-TE results in better convergence in the pre-training phase. With both effectiveness and efficiency, we think that doing multi-task learning on self-supervised speech models from our perspective is a promising trend.
翻译:本文从自训练目标获取方式的新视角出发,研究了自监督语音模型。我们将目标提取器概括为离线目标提取器(Off-TE)与在线目标提取器(On-TE)。基于此,我们提出一种新的自监督学习多任务训练框架MT4SSL,其全称为"通过整合多目标提升自监督语音表示学习"。MT4SSL参考了两种典型模型:分别采用K-means算法作为Off-TE的HuBERT,以及采用无梯度教师网络作为On-TE的data2vec。我们的模型在LibriSpeech基准测试中以显著优势超越以往的SSL方法,且能与最佳性能模型相媲美甚至更优,而无需依赖同等规模的数据量。此外,我们发现同时使用Off-TE与On-TE能提升预训练阶段的收敛效率。基于其有效性与高效性,我们认为从该视角对自监督语音模型开展多任务学习是一条极具前景的研究方向。