Dialogue policy learning (DPL) is a crucial component of dialogue modelling. Its primary role is to determine the appropriate abstract response, commonly referred to as the "dialogue action". Traditional DPL methodologies have treated this as a sequential decision problem, using pre-defined action candidates extracted from a corpus. However, these incomplete candidates can significantly limit the diversity of responses and pose challenges when dealing with edge cases, which are scenarios that occur only at extreme operating parameters. To address these limitations, we introduce a novel framework, JoTR. This framework is unique as it leverages a text-to-text Transformer-based model to generate flexible dialogue actions. Unlike traditional methods, JoTR formulates a word-level policy that allows for a more dynamic and adaptable dialogue action generation, without the need for any action templates. This setting enhances the diversity of responses and improves the system's ability to handle edge cases effectively. In addition, JoTR employs reinforcement learning with a reward-shaping mechanism to efficiently finetune the word-level dialogue policy, which allows the model to learn from its interactions, improving its performance over time. We conducted an extensive evaluation of JoTR to assess its effectiveness. Our extensive evaluation shows that JoTR achieves state-of-the-art performance on two benchmark dialogue modelling tasks, as assessed by both user simulators and human evaluators.
翻译:对话策略学习(DPL)是对话建模的关键组成部分,其主要作用是确定恰当的抽象响应(通常称为“对话动作”)。传统DPL方法将其视为序列决策问题,使用从语料库提取的预定义动作候选集。然而,这些不完整的候选集会显著限制响应的多样性,并且在处理仅出现在极端运行参数下的边缘场景时面临挑战。为解决上述局限,我们提出了一种新型框架JoTR,该框架独特地利用基于文本到文本的Transformer模型生成灵活的对话动作。与传统方法不同,JoTR制定了词级策略,无需任何动作模板即可实现更具动态性和适应性的对话动作生成。这种设置增强了响应多样性,并提升了系统高效应对边缘场景的能力。此外,JoTR采用带有奖励塑形机制的强化学习,以高效微调词级对话策略,使模型能够通过交互学习并持续优化性能。我们通过大量实验评估了JoTR的有效性。实验结果表明,在用户模拟器和人工评估者共同评估的两个基准对话建模任务中,JoTR均取得了当前最优性能。