Spoken Language Understanding (SLU) is a task that aims to extract semantic information from spoken utterances. Previous research has made progress in end-to-end SLU by using paired speech-text data, such as pre-trained Automatic Speech Recognition (ASR) models or paired text as intermediate targets. However, acquiring paired transcripts is expensive and impractical for unwritten languages. On the other hand, Textless SLU extracts semantic information from speech without utilizing paired transcripts. However, the absence of intermediate targets and training guidance for textless SLU often results in suboptimal performance. In this work, inspired by the content-disentangled discrete units from self-supervised speech models, we proposed to use discrete units as intermediate guidance to improve textless SLU performance. Our method surpasses the baseline method on five SLU benchmark corpora. Additionally, we find that unit guidance facilitates few-shot learning and enhances the model's ability to handle noise.
翻译:摘要:口语理解是一项旨在从语音中提取语义信息的任务。先前的研究通过使用配对语音-文本数据(如预训练自动语音识别模型或配对文本作为中间目标)在端到端口语理解方面取得了进展。然而,获取配对转录本成本高昂,且对无文字语言而言不切实际。另一方面,无文本口语理解在不使用配对转录本的情况下从语音中提取语义信息。然而,由于缺乏中间目标和训练指导,无文本口语理解通常性能欠佳。受自监督语音模型中内容解耦离散单元的启发,本研究提出使用离散单元作为中间指导来提升无文本口语理解的性能。我们的方法在五个口语理解基准语料库上均超越了基线方法。此外,我们发现单元指导有助于少样本学习,并增强了模型处理噪声的能力。