Recently, deep end-to-end learning has been studied for intent classification in Spoken Language Understanding (SLU). However, end-to-end models require a large amount of speech data with intent labels, and highly optimized models are generally sensitive to the inconsistency between the training and evaluation conditions. Therefore, a natural language understanding approach based on Automatic Speech Recognition (ASR) remains attractive because it can utilize a pre-trained general language model and adapt to the mismatch of the speech input environment. Using this module-based approach, we improve a noisy-channel model to handle transcription inconsistencies caused by ASR errors. We propose a two-stage method, Contrastive and Consistency Learning (CCL), that correlates error patterns between clean and noisy ASR transcripts and emphasizes the consistency of the latent features of the two transcripts. Experiments on four benchmark datasets show that CCL outperforms existing methods and improves the ASR robustness in various noisy environments. Code is available at https://github.com/syoung7388/CCL.
翻译:近年来,深度端到端学习在口语理解(SLU)的意图分类任务中得到了广泛研究。然而,端到端模型需要大量带意图标签的语音数据,且高度优化的模型通常对训练与评估条件之间的不一致性较为敏感。因此,基于自动语音识别(ASR)的自然语言理解方法仍具有吸引力,因为它可以利用预训练的通用语言模型,并适应语音输入环境的失配问题。采用这种模块化方法,我们改进了噪声信道模型以处理由ASR错误导致的转录不一致问题。我们提出了一种两阶段方法——对比与一致性学习(CCL),该方法通过关联干净与含噪ASR转录文本之间的错误模式,并强调两种转录文本潜在特征的一致性。在四个基准数据集上的实验表明,CCL优于现有方法,并在多种噪声环境下提升了ASR的鲁棒性。代码发布于 https://github.com/syoung7388/CCL。