Self-supervised models have had great success in learning speech representations that can generalize to various downstream tasks. However, most self-supervised models require a large amount of compute and multiple GPUs to train, significantly hampering the development of self-supervised learning. In an attempt to reduce the computation of training, we revisit the training of HuBERT, a highly successful self-supervised model. We improve and simplify several key components, including the loss function, input representation, and training in multiple stages. Our model, MelHuBERT, is able to achieve favorable performance on phone recognition, speaker identification, and automatic speech recognition against HuBERT, while saving 31.2% of the pre-training time, or equivalently 33.5% MACs per one second speech. The code and pre-trained models are available in https://github.com/nervjack2/MelHuBERT.
翻译:自监督模型在学习可泛化至多种下游任务的语音表示方面取得了巨大成功。然而,大多数自监督模型需要大量计算资源和多块GPU进行训练,这严重阻碍了自监督学习的发展。为降低训练计算量,我们重新审视了HuBERT这一高度成功的自监督模型的训练过程。我们对损失函数、输入表示以及多阶段训练等多个关键组件进行了改进与简化。我们的模型MelHuBERT在音素识别、说话人识别和自动语音识别任务上取得了与HuBERT相当的性能,同时节省了31.2%的预训练时间,或等价地每秒钟语音节省33.5%的MACs。代码与预训练模型已在https://github.com/nervjack2/MelHuBERT开源。