In Task-Oriented Dialogue (TOD) systems, correctly updating the system's understanding of the user's requests (\textit{a.k.a} dialogue state tracking) is key to a smooth interaction. Traditionally, TOD systems perform this update in three steps: transcription of the user's utterance, semantic extraction of the key concepts, and contextualization with the previously identified concepts. Such cascade approaches suffer from cascading errors and separate optimization. End-to-End approaches have been proven helpful up to the turn-level semantic extraction step. This paper goes one step further and provides (1) a novel approach for completely neural spoken DST, (2) an in depth comparison with a state of the art cascade approach and (3) avenues towards better context propagation. Our study highlights that jointly-optimized approaches are also competitive for contextually dependent tasks, such as Dialogue State Tracking (DST), especially in audio native settings. Context propagation in DST systems could benefit from training procedures accounting for the previous' context inherent uncertainty.
翻译:在任务导向对话系统中,正确更新系统对用户请求的理解(即对话状态跟踪)是实现流畅交互的关键。传统上,任务导向对话系统通过三个步骤执行此更新:用户话语的转录、关键概念的语义提取,以及与先前已识别概念的上下文关联。此类级联方法存在误差累积和分离优化的缺陷。端到端方法已被证明在话轮级语义提取步骤具有优势。本文更进一步,提出了(1)一种全新的完全神经化口语对话状态跟踪方法,(2)与先进级联方法的深入比较,以及(3)改进上下文传播的可行路径。我们的研究表明,联合优化方法在上下文依赖型任务(如对话状态跟踪)中同样具有竞争力,尤其在音频原生场景下。对话状态跟踪系统中的上下文传播,可通过考虑先前语境固有不确定性的训练机制获得提升。