Recent voice assistants are usually based on the cascade spoken language understanding (SLU) solution, which consists of an automatic speech recognition (ASR) engine and a natural language understanding (NLU) system. Because such approach relies on the ASR output, it often suffers from the so-called ASR error propagation. In this work, we investigate impacts of this ASR error propagation on state-of-the-art NLU systems based on pre-trained language models (PLM), such as BERT and RoBERTa. Moreover, a multimodal language understanding (MLU) module is proposed to mitigate SLU performance degradation caused by errors present in the ASR transcript. The MLU benefits from self-supervised features learned from both audio and text modalities, specifically Wav2Vec for speech and Bert/RoBERTa for language. Our MLU combines an encoder network to embed the audio signal and a text encoder to process text transcripts followed by a late fusion layer to fuse audio and text logits. We found that the proposed MLU showed to be robust towards poor quality ASR transcripts, while the performance of BERT and RoBERTa are severely compromised. Our model is evaluated on five tasks from three SLU datasets and robustness is tested using ASR transcripts from three ASR engines. Results show that the proposed approach effectively mitigates the ASR error propagation problem, surpassing the PLM models' performance across all datasets for the academic ASR engine.
翻译:最近的语音助手通常基于级联口语理解(SLU)解决方案,该方案由自动语音识别(ASR)引擎和自然语言理解(NLU)系统组成。由于此类方法依赖于ASR输出,因此常遭受所谓的ASR错误传播问题。在本工作中,我们研究了这种ASR错误传播对基于预训练语言模型(PLM)(如BERT和RoBERTa)的最先进NLU系统的影响。此外,我们提出了一种多模态语言理解(MLU)模块,以缓解由ASR转录文本中错误导致的SLU性能下降。MLU受益于从音频和文本两种模态中学习的自监督特征,具体而言,Wav2Vec用于语音,Bert/RoBERTa用于语言。我们的MLU结合了一个编码器网络以嵌入音频信号和一个文本编码器以处理文本转录,随后通过一个后期融合层来融合音频和文本逻辑。我们发现,所提出的MLU对低质量ASR转录文本表现出鲁棒性,而BERT和RoBERTa的性能则受到严重损害。我们在来自三个SLU数据集的五个任务上评估了模型,并使用来自三个ASR引擎的转录文本测试了鲁棒性。结果表明,所提出的方法有效缓解了ASR错误传播问题,在学术ASR引擎的所有数据集上均超越了PLM模型的性能。