Most human interactions occur in the form of spoken conversations where the semantic meaning of a given utterance depends on the context. Each utterance in spoken conversation can be represented by many semantic and speaker attributes, and there has been an interest in building Spoken Language Understanding (SLU) systems for automatically predicting these attributes. Recent work has shown that incorporating dialogue history can help advance SLU performance. However, separate models are used for each SLU task, leading to an increase in inference time and computation cost. Motivated by this, we aim to ask: can we jointly model all the SLU tasks while incorporating context to facilitate low-latency and lightweight inference? To answer this, we propose a novel model architecture that learns dialog context to jointly predict the intent, dialog act, speaker role, and emotion for the spoken utterance. Note that our joint prediction is based on an autoregressive model and we need to decide the prediction order of dialog attributes, which is not trivial. To mitigate the issue, we also propose an order agnostic training method. Our experiments show that our joint model achieves similar results to task-specific classifiers and can effectively integrate dialog context to further improve the SLU performance.
翻译:大多数人类交互以口头对话形式发生,其中给定话语的语义含义取决于上下文。口头对话中的每个话语可通过多种语义和说话人属性表示,近年来学术界对构建自动预测这些属性的口语语言理解(SLU)系统产生了兴趣。近期研究表明,融合对话历史有助于提升SLU性能。然而,现有方法为每个SLU任务分别建模,导致推理时间和计算成本增加。为此,我们提出核心问题:能否在保持低延迟和轻量推理的前提下,联合建模所有SLU任务并融入上下文信息?为解答该问题,我们提出一种新型模型架构,通过学习对话上下文联合预测口语话语的意图、对话行为、说话人角色和情感。需注意,我们的联合预测基于自回归模型,而对话属性的预测顺序决策并非易事。针对这一难点,我们还提出一种顺序无关的训练方法。实验表明,我们的联合模型能达到与任务专用分类器相当的性能,并能有效整合对话上下文以进一步提升SLU表现。