Recent dialogue systems rely on turn-based spoken interactions, requiring accurate Automatic Speech Recognition (ASR). Errors in ASR can significantly impact downstream dialogue tasks. To address this, using dialogue context from user and agent interactions for transcribing subsequent utterances has been proposed. This method incorporates the transcription of the user's speech and the agent's response as model input, using the accumulated context generated by each turn. However, this context is susceptible to ASR errors because it is generated by the ASR model in an auto-regressive fashion. Such noisy context can further degrade the benefits of context input, resulting in suboptimal ASR performance. In this paper, we introduce Context Noise Representation Learning (CNRL) to enhance robustness against noisy context, ultimately improving dialogue speech recognition accuracy. To maximize the advantage of context awareness, our approach includes decoder pre-training using text-based dialogue data and noise representation learning for a context encoder. Based on the evaluation of speech dialogues, our method shows superior results compared to baselines. Furthermore, the strength of our approach is highlighted in noisy environments where user speech is barely audible due to real-world noise, relying on contextual information to transcribe the input accurately.
翻译:近年来,对话系统依赖于基于轮次的语音交互,这要求高精度的自动语音识别(ASR)。ASR中的错误会显著影响下游对话任务。为解决此问题,已有研究提出利用来自用户和智能体交互的对话上下文来转录后续话语。该方法将用户语音的转录文本和智能体的响应作为模型输入,利用每轮对话生成的累积上下文。然而,由于该上下文是由ASR模型以自回归方式生成的,因此容易受到ASR错误的影响。这种含噪声的上下文可能进一步削弱上下文输入的优势,导致次优的ASR性能。本文提出上下文噪声表征学习(CNRL),以增强对噪声上下文的鲁棒性,最终提升对话语音识别的准确性。为最大化上下文感知的优势,我们的方法包括使用基于文本的对话数据进行解码器预训练,以及对上下文编码器进行噪声表征学习。基于语音对话的评估,我们的方法相较于基线模型展现出更优的结果。此外,本方法的优势在嘈杂环境中尤为突出:当用户语音因现实噪声几乎无法听清时,模型可依赖上下文信息准确转录输入语音。