Multi-Action Dialog Policy Learning from Logged User Feedback

Multi-action dialog policy, which generates multiple atomic dialog actions per turn, has been widely applied in task-oriented dialog systems to provide expressive and efficient system responses. Existing policy models usually imitate action combinations from the labeled multi-action dialog examples. Due to data limitations, they generalize poorly toward unseen dialog flows. While reinforcement learning-based methods are proposed to incorporate the service ratings from real users and user simulators as external supervision signals, they suffer from sparse and less credible dialog-level rewards. To cope with this problem, we explore to improve multi-action dialog policy learning with explicit and implicit turn-level user feedback received for historical predictions (i.e., logged user feedback) that are cost-efficient to collect and faithful to real-world scenarios. The task is challenging since the logged user feedback provides only partial label feedback limited to the particular historical dialog actions predicted by the agent. To fully exploit such feedback information, we propose BanditMatch, which addresses the task from a feedback-enhanced semi-supervised learning perspective with a hybrid objective of semi-supervised learning and bandit learning. BanditMatch integrates pseudo-labeling methods to better explore the action space through constructing full label feedback. Extensive experiments show that our BanditMatch outperforms the state-of-the-art methods by generating more concise and informative responses. The source code and the appendix of this paper can be obtained from https://github.com/ShuoZhangXJTU/BanditMatch.

翻译：多动作对话策略能够在每轮对话中生成多个原子动作，已被广泛应用于任务型对话系统，以提供富有表现力和高效的系统响应。现有策略模型通常从标注的多动作对话示例中模仿动作组合。由于数据限制，它们对未见过的对话流程泛化能力较差。虽然基于强化学习的方法引入了来自真实用户和用户模拟器的服务评分作为外部监督信号，但存在稀疏且不可靠的对话级奖励问题。为解决这一问题，我们探索利用历史预测中收到的显式和隐式轮次级用户反馈（即日志用户反馈）来改进多动作对话策略学习，这些反馈收集成本低且忠实于真实场景。该任务具有挑战性，因为日志用户反馈仅提供与代理预测的特定历史对话动作相关的部分标签反馈。为充分利用此类反馈信息，我们提出BanditMatch方法，从反馈增强的半监督学习角度出发，采用半监督学习和赌博机学习的混合目标函数来处理该任务。BanditMatch通过构建完整标签反馈，整合伪标签方法以更好地探索动作空间。大量实验表明，我们的BanditMatch通过生成更简洁且信息丰富的响应，优于现有最先进方法。本文的源代码和附录可从https://github.com/ShuoZhangXJTU/BanditMatch获取。