Task-oriented dialogue (TOD) systems aim to achieve specific goals through interactive dialogue. Such tasks usually involve following specific workflows, i.e. executing a sequence of actions in a particular order. While prior work has focused on supervised learning methods to condition on past actions, they do not explicitly optimize for compliance to a desired workflow. In this paper, we propose a novel framework based on reinforcement learning (RL) to generate dialogue responses that are aligned with a given workflow. Our framework consists of ComplianceScorer, a metric designed to evaluate how well a generated response executes the specified action, combined with an RL opimization process that utilizes an interactive sampling technique. We evaluate our approach on two TOD datasets, Action-Based Conversations Dataset (ABCD) (Chen et al., 2021a) and MultiWOZ 2.2 (Zang et al., 2020) on a range of automated and human evaluation metrics. Our findings indicate that our RL-based framework outperforms baselines and is effective at enerating responses that both comply with the intended workflows while being expressed in a natural and fluent manner.
翻译:任务型对话(TOD)系统旨在通过交互式对话实现特定目标。此类任务通常需遵循特定工作流,即按特定顺序执行一系列操作。以往研究主要采用基于先前动作条件的监督学习方法,但并未显式优化对预期工作流的遵循度。本文提出一种基于强化学习(RL)的新框架,用于生成与给定工作流对齐的对话回复。该框架包含一个名为ComplianceScorer的评估指标,旨在衡量生成回复对指定动作的执行效果,并结合采用交互式采样技术的RL优化过程。我们在两个TOD数据集——基于动作的对话数据集(ABCD)(Chen等,2021a)和MultiWOZ 2.2(Zang等,2020)上,通过一系列自动评估与人工评估指标验证了方法的有效性。实验结果表明,基于RL的框架优于基线方法,能有效生成既符合预期工作流又保持自然流畅表达方式的回复。