Reinforcement learning based dialogue policies are typically trained in interaction with a user simulator. To obtain an effective and robust policy, this simulator should generate user behaviour that is both realistic and varied. Current data-driven simulators are trained to accurately model the user behaviour in a dialogue corpus. We propose an alternative method using adversarial learning, with the aim to simulate realistic user behaviour with more variation. We train and evaluate several simulators on a corpus of restaurant search dialogues, and then use them to train dialogue system policies. In policy cross-evaluation experiments we demonstrate that an adversarially trained simulator produces policies with 8.3% higher success rate than those trained with a maximum likelihood simulator. Subjective results from a crowd-sourced dialogue system user evaluation confirm the effectiveness of adversarially training user simulators.
翻译:基于强化学习的对话策略通常在与用户模拟器的交互中训练。为获得有效且鲁棒的策略,该模拟器应生成既真实又多样的用户行为。当前数据驱动的模拟器通过精准建模对话语料库中的用户行为进行训练。我们提出一种基于对抗性学习的替代方法,旨在以更丰富的多样性模拟真实用户行为。我们在餐厅搜索对话语料库上训练并评估多个模拟器,随后利用它们训练对话系统策略。在策略交叉评估实验中,我们证明:与基于最大似然模拟器训练的策略相比,对抗性训练模拟器产生的策略成功率高出8.3%。来自众包对话系统用户评估的主观结果进一步验证了对抗性训练用户模拟器的有效性。