Training a dialogue policy using deep reinforcement learning requires a lot of exploration of the environment. The amount of wasted invalid exploration makes their learning inefficient. In this paper, we find and define an important reason for the invalid exploration: dead-ends. When a conversation enters a dead-end state, regardless of the actions taken afterward, it will continue in a dead-end trajectory until the agent reaches a termination state or maximum turn. We propose a dead-end resurrection (DDR) algorithm that detects the initial dead-end state in a timely and efficient manner and provides a rescue action to guide and correct the exploration direction. To prevent dialogue policies from repeatedly making the same mistake, DDR also performs dialogue data augmentation by adding relevant experiences containing dead-end states. We first validate the dead-end detection reliability and then demonstrate the effectiveness and generality of the method by reporting experimental results on several dialogue datasets from different domains.
翻译:使用深度强化学习训练对话策略需要大量的环境探索。大量无效的探索行为导致学习效率低下。本文发现并定义了无效探索的一个重要原因:死胡同。当对话进入死胡同状态后,无论后续采取何种动作,都将沿着死胡同轨迹持续运行,直到智能体达到终止状态或最大轮次。我们提出了一种死胡同复活算法,该算法能及时高效地检测初始死胡同状态,并提供救援动作以引导和修正探索方向。为防止对话策略重复犯相同错误,该算法还通过添加包含死胡同状态的相关经验进行对话数据增强。我们首先验证了死胡同检测的可靠性,然后通过报告来自不同领域的多个对话数据集上的实验结果,证明了该方法的有效性和通用性。