End-to-end multi-task dialogue systems are usually designed with separate modules for the dialogue pipeline. Among these, the policy module is essential for deciding what to do in response to user input. This policy is trained by reinforcement learning algorithms by taking advantage of an environment in which an agent receives feedback in the form of a reward signal. The current dialogue systems, however, only provide meagre and simplistic rewards. Investigating intrinsic motivation reinforcement learning algorithms is the goal of this study. Through this, the agent can quickly accelerate training and improve its capacity to judge the quality of its actions by teaching it an internal incentive system. In particular, we adapt techniques for random network distillation and curiosity-driven reinforcement learning to measure the frequency of state visits and encourage exploration by using semantic similarity between utterances. Experimental results on MultiWOZ, a heterogeneous dataset, show that intrinsic motivation-based debate systems outperform policies that depend on extrinsic incentives. By adopting random network distillation, for example, which is trained using semantic similarity between user-system dialogues, an astounding average success rate of 73% is achieved. This is a significant improvement over the baseline Proximal Policy Optimization (PPO), which has an average success rate of 60%. In addition, performance indicators such as booking rates and completion rates show a 10% rise over the baseline. Furthermore, these intrinsic incentive models help improve the system's policy's resilience in an increasing amount of domains. This implies that they could be useful in scaling up to settings that cover a wider range of domains.
翻译:端到端多任务对话系统通常采用独立模块构建对话流程。其中,策略模块对于决定如何响应用户输入至关重要。该策略通过强化学习算法在环境中训练,智能体通过接收奖励信号形式的反馈进行优化。然而,当前对话系统仅能提供稀疏且简化的奖励。本研究旨在探索内在动机强化学习算法。通过该方法,智能体可构建内部激励体系,从而加速训练进程并提升对自身动作质量的判断能力。具体而言,我们采用随机网络蒸馏与好奇心驱动强化学习技术,通过测量状态访问频率并利用语句间语义相似性促进探索行为。在异构数据集MultiWOZ上的实验表明,基于内在动机的辩论系统在性能上优于依赖外在激励的策略。例如,采用基于用户-系统对话语义相似性训练的随机网络蒸馏方法,系统平均成功率高达73%,较基线近端策略优化(PPO)的60%有显著提升。此外,预订率与完成率等性能指标较基准提升10%。更重要的是,这些内在激励模型有助于提升系统策略在持续增长的领域中的鲁棒性,表明该方法可拓展至覆盖更广泛领域的应用场景。