Dialog policies, which determine a system's action based on the current state at each dialog turn, are crucial to the success of the dialog. In recent years, reinforcement learning (RL) has emerged as a promising option for dialog policy learning (DPL). In RL-based DPL, dialog policies are updated according to rewards. The manual construction of fine-grained rewards, such as state-action-based ones, to effectively guide the dialog policy is challenging in multi-domain task-oriented dialog scenarios with numerous state-action pair combinations. One way to estimate rewards from collected data is to train the reward estimator and dialog policy simultaneously using adversarial learning (AL). Although this method has demonstrated superior performance experimentally, it is fraught with the inherent problems of AL, such as mode collapse. This paper first identifies the role of AL in DPL through detailed analyses of the objective functions of dialog policy and reward estimator. Next, based on these analyses, we propose a method that eliminates AL from reward estimation and DPL while retaining its advantages. We evaluate our method using MultiWOZ, a multi-domain task-oriented dialog corpus.
翻译:对话策略决定了系统在每个对话轮次中基于当前状态所采取的动作,对对话的成功至关重要。近年来,强化学习(RL)已成为对话策略学习(DPL)中一种有前景的选择。在基于强化学习的对话策略学习中,对话策略根据奖励进行更新。在多领域任务型对话场景中,存在大量的状态-动作对组合,针对这些场景手工构建细粒度奖励(如基于状态-动作的奖励)以有效引导对话策略颇具挑战性。从收集的数据中估计奖励的一种方法是使用对抗学习(AL)同时训练奖励估计器和对话策略。尽管该方法在实验上表现出优异性能,但它存在对抗学习的固有问题,如模式崩溃。本文首先通过对对话策略和奖励估计器目标函数的详细分析,阐明了对抗学习在对话策略学习中的作用。接着,基于这些分析,我们提出了一种方法,该方法在保持对抗学习优势的同时,将其从奖励估计和对话策略学习中消除。我们使用多领域任务型对话语料库MultiWOZ对所提方法进行了评估。