Reinforcement learning (RL) is a powerful approach to enhance task-oriented dialogue (TOD) systems. However, existing RL methods tend to mainly focus on generation tasks, such as dialogue policy learning (DPL) or response generation (RG), while neglecting dialogue state tracking (DST) for understanding. This narrow focus limits the systems to achieve globally optimal performance by overlooking the interdependence between understanding and generation. Additionally, RL methods face challenges with sparse and delayed rewards, which complicates training and optimization. To address these issues, we extend RL into both understanding and generation tasks by introducing step-by-step rewards throughout the token generation. The understanding reward increases as more slots are correctly filled in DST, while the generation reward grows with the accurate inclusion of user requests. Our approach provides a balanced optimization aligned with task completion. Experimental results demonstrate that our approach effectively enhances the performance of TOD systems and achieves new state-of-the-art results on three widely used datasets, including MultiWOZ2.0, MultiWOZ2.1, and In-Car. Our approach also shows superior few-shot ability in low-resource settings compared to current models.
翻译:强化学习(RL)是提升任务型对话(TOD)系统的有效方法。然而,现有的强化学习方法主要集中于生成任务,例如对话策略学习(DPL)或回复生成(RG),而忽视了用于理解的对话状态跟踪(DST)。这种局限的关注点因忽略了理解与生成之间的相互依赖关系,限制了系统实现全局最优性能。此外,强化学习方法还面临奖励稀疏和延迟的挑战,这增加了训练和优化的复杂性。为解决这些问题,我们通过在整个令牌生成过程中引入逐步奖励,将强化学习扩展到理解与生成任务中。理解奖励随着DST中正确填充的槽位数量增加而提高,而生成奖励则随着用户请求被准确包含而增长。我们的方法提供了与任务完成相一致的平衡优化。实验结果表明,我们的方法有效提升了TOD系统的性能,并在三个广泛使用的数据集(包括MultiWOZ2.0、MultiWOZ2.1和In-Car)上取得了新的最优结果。与现有模型相比,我们的方法在低资源环境下也展现出卓越的少样本学习能力。