Despite being the best known objective for learning speech representations, the HuBERT objective has not been further developed and improved. We argue that it is the lack of an underlying principle that stalls the development, and, in this paper, we show that predictive coding under a variational view is the principle behind the HuBERT objective. Due to its generality, our formulation provides opportunities to improve parameterization and optimization, and we show two simple modifications that bring immediate improvements to the HuBERT objective. In addition, the predictive coding formulation has tight connections to various other objectives, such as APC, CPC, wav2vec, and BEST-RQ. Empirically, the improvement in pre-training brings significant improvements to four downstream tasks: phone classification, f0 tracking, speaker recognition, and automatic speech recognition, highlighting the importance of the predictive coding interpretation.
翻译:尽管HuBERT目标是学习语音表征的最佳已知目标,但其尚未得到进一步发展和改进。我们认为,缺乏理论基础阻碍了其发展。本文证明,变分视角下的预测编码正是HuBERT目标背后的原理。由于其普适性,我们的理论框架为参数化与优化改进提供了可能。我们展示了两种简单修改方案,可立即提升HuBERT目标的性能。此外,该预测编码框架与APC、CPC、wav2vec、BEST-RQ等多种目标函数存在紧密联系。实验表明,预训练的改进显著提升了四项下游任务的性能:音素分类、基频追踪、说话人识别与自动语音识别,这凸显了预测编码理论阐释的重要性。