There has been an increased interest in the integration of pretrained speech recognition (ASR) and language models (LM) into the SLU framework. However, prior methods often struggle with a vocabulary mismatch between pretrained models, and LM cannot be directly utilized as they diverge from its NLU formulation. In this study, we propose a three-pass end-to-end (E2E) SLU system that effectively integrates ASR and LM subnetworks into the SLU formulation for sequence generation tasks. In the first pass, our architecture predicts ASR transcripts using the ASR subnetwork. This is followed by the LM subnetwork, which makes an initial SLU prediction. Finally, in the third pass, the deliberation subnetwork conditions on representations from the ASR and LM subnetworks to make the final prediction. Our proposed three-pass SLU system shows improved performance over cascaded and E2E SLU models on two benchmark SLU datasets, SLURP and SLUE, especially on acoustically challenging utterances.
翻译:近年来,将预训练的语音识别(ASR)和语言模型(LM)整合到口语语言理解(SLU)框架中引起了广泛关注。然而,现有方法常受限于预训练模型间的词汇不匹配问题,且由于语言模型与自然语言理解(NLU)的建模范式存在差异,其无法被直接应用。本研究提出一种三阶段端到端(E2E)SLU系统,通过将ASR和LM子网络有效整合到SLU框架中,实现了序列生成任务。第一阶段利用ASR子网络预测ASR转录文本;第二阶段通过LM子网络生成初始SLU预测;第三阶段中,审慎子网络基于ASR和LM子网络的表征进行条件建模,输出最终预测。在SLURP和SLUE两个基准SLU数据集上的实验表明,所提出的三阶段SLU系统在性能上优于级联式及端到端SLU模型,尤其在声学挑战性话语上表现更佳。