Recently, the usefulness of self-supervised representation learning (SSRL) methods has been confirmed in various downstream tasks. Many of these models, as exemplified by HuBERT and WavLM, use pseudo-labels generated from spectral features or the model's own representation features. From previous studies, it is known that the pseudo-labels contain semantic information. However, the masked prediction task, the learning criterion of HuBERT, focuses on local contextual information and may not make effective use of global semantic information such as speaker, theme of speech, and so on. In this paper, we propose a new approach to enrich the semantic representation of HuBERT. We apply topic model to pseudo-labels to generate a topic label for each utterance. An auxiliary topic classification task is added to HuBERT by using topic labels as teachers. This allows additional global semantic information to be incorporated in an unsupervised manner. Experimental results demonstrate that our method achieves comparable or better performance than the baseline in most tasks, including automatic speech recognition and five out of the eight SUPERB tasks. Moreover, we find that topic labels include various information about utterance, such as gender, speaker, and its theme. This highlights the effectiveness of our approach in capturing multifaceted semantic nuances.
翻译:近年来,自监督表示学习(SSRL)方法在各类下游任务中的有效性已得到验证。以HuBERT和WavLM为代表的诸多模型,利用从频谱特征或模型自身表示特征生成的伪标签进行训练。已有研究表明,伪标签蕴含语义信息。然而,作为HuBERT学习准则的掩码预测任务聚焦于局部上下文信息,可能无法有效利用说话人、语音主题等全局语义信息。本文提出一种丰富HuBERT语义表示的新方法:对伪标签应用主题模型,为每条语句生成主题标签,并通过将其作为教师信号为HuBERT增设辅助主题分类任务。这使得能够以无监督方式融入额外的全局语义信息。实验结果表明,在自动语音识别及SUPERB基准中八项任务中的五项上,该方法性能与基线相当或更优。此外,我们发现主题标签包含性别、说话人及语音主题等多种语句信息,凸显了本方法在捕获多维度语义细节方面的有效性。