Self-supervised learning (SSL) speech models such as wav2vec and HuBERT have demonstrated state-of-the-art performance on automatic speech recognition (ASR) and proved to be extremely useful in low label-resource settings. However, the success of SSL models has yet to transfer to utterance-level tasks such as speaker, emotion, and language recognition, which still require supervised fine-tuning of the SSL models to obtain good performance. We argue that the problem is caused by the lack of disentangled representations and an utterance-level learning objective for these tasks. Inspired by how HuBERT uses clustering to discover hidden acoustic units, we formulate a factor analysis (FA) model that uses the discovered hidden acoustic units to align the SSL features. The underlying utterance-level representations are disentangled from the content of speech using probabilistic inference on the aligned features. Furthermore, the variational lower bound derived from the FA model provides an utterance-level objective, allowing error gradients to be backpropagated to the Transformer layers to learn highly discriminative acoustic units. When used in conjunction with HuBERT's masked prediction training, our models outperform the current best model, WavLM, on all utterance-level non-semantic tasks on the SUPERB benchmark with only 20% of labeled data.
翻译:自监督学习(SSL)语音模型(如wav2vec和HuBERT)已在自动语音识别(ASR)任务中展现出最先进的性能,并在低标注资源场景中极为有效。然而,SSL模型在话语级任务(如说话人、情感及语言识别)中尚未取得同等成功,这些任务仍需对SSL模型进行有监督微调以获得良好表现。我们认为该问题源于缺乏解耦表征及针对此类任务的话语级学习目标。受HuBERT通过聚类发现隐式声学单元的启发,我们构建了一个因子分析(FA)模型,利用所发现的隐式声学单元对齐SSL特征。通过概率推理,基于对齐特征实现潜在话语级表征与语音内容的解耦。此外,由FA模型推导的变分下界提供了话语级目标函数,使得误差梯度能够反向传播至Transformer层,以学习高判别力的声学单元。当与HuBERT的掩码预测训练结合使用时,我们的模型在SUPERB基准的所有话语级非语义任务上,仅需20%标注数据即可超越当前最佳模型WavLM。