Spoken Language Understanding (SLU) is a task that aims to extract semantic information from spoken utterances. Previous research has made progress in end-to-end SLU by using paired speech-text data, such as pre-trained Automatic Speech Recognition (ASR) models or paired text as intermediate targets. However, acquiring paired transcripts is expensive and impractical for unwritten languages. On the other hand, Textless SLU extracts semantic information from speech without utilizing paired transcripts. However, the absence of intermediate targets and training guidance for textless SLU often results in suboptimal performance. In this work, inspired by the content-disentangled discrete units from self-supervised speech models, we proposed to use discrete units as intermediate guidance to improve textless SLU performance. Our method surpasses the baseline method on five SLU benchmark corpora. Additionally, we find that unit guidance facilitates few-shot learning and enhances the model's ability to handle noise.
翻译:口语理解(SLU)是一项旨在从语音中提取语义信息的任务。先前研究通过利用语音-文本配对数据(如预训练的自动语音识别(ASR)模型或作为中间目标的配对文本)在端到端SLU方面取得了进展。然而,获取配对转录本成本高昂,且对无文字语言而言不切实际。另一方面,无文本SLU在不使用配对转录本的情况下从语音中提取语义信息。但缺乏中间目标和训练指导往往导致无文本SLU性能欠佳。本工作受自监督语音模型中内容解耦离散单元的启发,提出使用离散单元作为中间指导以提升无文本SLU性能。我们的方法在五个SLU基准语料库上均超越了基线方法。此外,我们发现单元指导有助于少样本学习并增强模型处理噪声的能力。