Training task-oriented dialog agents based on reinforcement learning is time-consuming and requires a large number of interactions with real users. How to grasp dialog policy within limited dialog experiences remains an obstacle that makes the agent training process less efficient. In addition, most previous frameworks start training by randomly choosing training samples, which differs from the human learning method and hurts the efficiency and stability of training. Therefore, we propose Scheduled Curiosity-Deep Dyna-Q (SC-DDQ), a curiosity-driven curriculum learning framework based on a state-of-the-art model-based reinforcement learning dialog model, Deep Dyna-Q (DDQ). Furthermore, we designed learning schedules for SC-DDQ and DDQ, respectively, following two opposite training strategies: classic curriculum learning and its reverse version. Our results show that by introducing scheduled learning and curiosity, the new framework leads to a significant improvement over the DDQ and Deep Q-learning(DQN). Surprisingly, we found that traditional curriculum learning was not always effective. Specifically, according to the experimental results, the easy-first and difficult-first strategies are more suitable for SC-DDQ and DDQ. To analyze our results, we adopted the entropy of sampled actions to depict action exploration and found that training strategies with high entropy in the first stage and low entropy in the last stage lead to better performance.
翻译:基于强化学习的任务型对话代理训练耗时且需要与真实用户进行大量交互。如何在有限的对话经验中掌握对话策略仍是降低代理训练效率的障碍。此外,大多数现有框架通过随机选择训练样本开始训练,这与人类学习方法不同,损害了训练的效率和稳定性。为此,我们提出定时好奇心-深度Dyna-Q(SC-DDQ),一种基于最先进的基于模型的强化学习对话模型深度Dyna-Q(DDQ)的好奇心驱动课程学习框架。此外,我们分别为SC-DDQ和DDQ设计了遵循两种相反训练策略(经典课程学习及其反向版本)的学习调度方案。实验结果表明,通过引入定时学习和好奇心机制,新框架相较于DDQ和深度Q学习(DQN)取得了显著改进。令人惊讶的是,我们发现传统课程学习并非始终有效。具体而言,根据实验结果,易-难优先策略与难-易优先策略分别更适用于SC-DDQ和DDQ。为分析结果,我们采用采样动作的熵来表征动作探索,发现初始阶段高熵、最终阶段低熵的训练策略能带来更优性能。