Although there have been remarkable advances in dialogue systems through the dialogue systems technology competition (DSTC), it remains one of the key challenges to building a robust task-oriented dialogue system with a speech interface. Most of the progress has been made for text-based dialogue systems since there are abundant datasets with written corpora while those with spoken dialogues are very scarce. However, as can be seen from voice assistant systems such as Siri and Alexa, it is of practical importance to transfer the success to spoken dialogues. In this paper, we describe our engineering effort in building a highly successful model that participated in the speech-aware dialogue systems technology challenge track in DSTC11. Our model consists of three major modules: (1) automatic speech recognition error correction to bridge the gap between the spoken and the text utterances, (2) text-based dialogue system (D3ST) for estimating the slots and values using slot descriptions, and (3) post-processing for recovering the error of the estimated slot value. Our experiments show that it is important to use an explicit automatic speech recognition error correction module, post-processing, and data augmentation to adapt a text-based dialogue state tracker for spoken dialogue corpora.
翻译:尽管通过对话系统技术竞赛(DSTC),对话系统取得了显著进展,但构建具有语音界面的稳健任务导向型对话系统仍是关键挑战之一。大部分进展集中于基于文本的对话系统,原因在于存在大量包含书面语料的数据集,而包含口语对话的数据集则极为稀缺。然而,从Siri和Alexa等语音助手系统可以看出,将这一成功迁移至口语对话具有实际重要性。本文描述了我们在构建一个高性能模型方面的工程努力,该模型参与了DSTC11中语音感知对话系统技术挑战赛道。我们的模型包含三个主要模块:(1)自动语音识别错误纠正模块,用于弥合口语与文本表述之间的差距;(2)基于文本的对话系统(D3ST),利用槽位描述估计槽位及其取值;(3)后处理模块,用于恢复估计槽位值中的错误。实验表明,显式使用自动语音识别错误纠正模块、后处理及数据增强,对于将基于文本的对话状态追踪器适配于口语对话语料库至关重要。