Self-supervised learning (SSL) speech models such as wav2vec and HuBERT have demonstrated state-of-the-art performance on automatic speech recognition (ASR) and proved to be extremely useful in low label-resource settings. However, the success of SSL models has yet to transfer to utterance-level tasks such as speaker, emotion, and language recognition, which still require supervised fine-tuning of the SSL models to obtain good performance. We argue that the problem is caused by the lack of disentangled representations and an utterance-level learning objective for these tasks. Inspired by how HuBERT uses clustering to discover hidden acoustic units, we formulate a factor analysis (FA) model that uses the discovered hidden acoustic units to align the SSL features. The underlying utterance-level representations are disentangled from the content of speech using probabilistic inference on the aligned features. Furthermore, the variational lower bound derived from the FA model provides an utterance-level objective, allowing error gradients to be backpropagated to the Transformer layers to learn highly discriminative acoustic units. When used in conjunction with HuBERT's masked prediction training, our models outperform the current best model, WavLM, on all utterance-level non-semantic tasks on the SUPERB benchmark with only 20% of labeled data.
翻译:自监督学习(SSL)语音模型(如wav2vec和HuBERT)在自动语音识别(ASR)任务中展现出最先进的性能,并在低标签资源场景中被证明极具实用性。然而,SSL模型的成功尚未迁移至说话人识别、情感识别及语种识别等话语级任务——这些任务仍需对SSL模型进行监督微调才能获得良好性能。我们认为,问题根源在于缺乏解耦表征及针对此类任务的话语级学习目标。受HuBERT利用聚类发现隐藏声学单元的启发,我们构建了一个因子分析(FA)模型,该模型利用发现的隐藏声学单元对齐SSL特征。通过基于对齐特征的概率推理,将底层话语级表征与语音内容解耦。进一步地,从FA模型导出的变分下界提供了话语级目标,使得误差梯度可反向传播至Transformer层以学习高判别性声学单元。当结合HuBERT的掩码预测训练时,我们的模型在SUPERB基准的所有非语义话语级任务中,仅使用20%标注数据便超越了当前最优模型WavLM。