It is challenging to extract semantic meanings directly from audio signals in spoken language understanding (SLU), due to the lack of textual information. Popular end-to-end (E2E) SLU models utilize sequence-to-sequence automatic speech recognition (ASR) models to extract textual embeddings as input to infer semantics, which, however, require computationally expensive auto-regressive decoding. In this work, we leverage self-supervised acoustic encoders fine-tuned with Connectionist Temporal Classification (CTC) to extract textual embeddings and use joint CTC and SLU losses for utterance-level SLU tasks. Experiments show that our model achieves 4% absolute improvement over the the state-of-the-art (SOTA) dialogue act classification model on the DSTC2 dataset and 1.3% absolute improvement over the SOTA SLU model on the SLURP dataset.
翻译:口语理解(SLU)中从音频信号直接提取语义信息具有挑战性,主要原因是缺乏文本信息。流行的端到端(E2E)SLU模型利用序列到序列的自动语音识别(ASR)模型提取文本嵌入作为语义推断的输入,但这些模型需要计算成本高昂的自回归解码。本研究采用基于连接主义时序分类(CTC)微调的自监督声学编码器提取文本嵌入,并联合使用CTC损失与SLU损失完成语句级SLU任务。实验表明,本模型在DSTC2数据集上的对话行为分类任务中较当前最优(SOTA)模型实现4%的绝对提升,在SLURP数据集上的SLU任务中较SOTA模型实现1.3%的绝对提升。