Recent years have witnessed significant advancements in self-supervised learning (SSL) methods for speech-processing tasks. Various speech-based SSL models have been developed and present promising performance on a range of downstream tasks including speech recognition. However, existing speech-based SSL models face a common dilemma in terms of computational cost, which might hinder their potential application and in-depth academic research. To address this issue, we first analyze the computational cost of different modules during HuBERT pre-training and then introduce a stack of efficiency optimizations, which is named Fast-HuBERT in this paper. The proposed Fast-HuBERT can be trained in 1.1 days with 8 V100 GPUs on the Librispeech 960h benchmark, without performance degradation, resulting in a 5.2x speedup, compared to the original implementation. Moreover, we explore two well-studied techniques in the Fast-HuBERT and demonstrate consistent improvements as reported in previous work.
翻译:近年来,自监督学习(SSL)方法在语音处理任务中取得了显著进展。多种基于语音的SSL模型已被开发出来,并在包括语音识别在内的一系列下游任务中展现出良好的性能。然而,现有的基于语音的SSL模型在计算成本方面面临一个普遍困境,这可能阻碍其潜在应用和深入的学术研究。为解决这一问题,我们首先分析了HuBERT预训练过程中不同模块的计算成本,然后引入了一系列效率优化措施,本文将其命名为Fast-HuBERT。所提出的Fast-HuBERT在Librispeech 960小时基准测试中,使用8块V100 GPU可在1.1天内完成训练,且性能未下降,相较于原始实现实现了5.2倍的加速。此外,我们在Fast-HuBERT中探索了两种经过充分研究的技术,并证明了与以往工作一致的持续改进。