Conventionally, since the natural language action space is astronomical, approximate dynamic programming applied to dialogue generation involves policy improvement with action sampling. However, such a practice is inefficient for reinforcement learning (RL) because the eligible (high action value) responses are very sparse, and the greedy policy sustained by the random sampling is flabby. This paper shows that the performance of dialogue policy positively correlated with sampling size by theoretical and experimental. We introduce a novel dual-granularity Q-function to alleviate this limitation by exploring the most promising response category to intervene in the sampling. It extracts the actions following the grained hierarchy, which can achieve the optimum with fewer policy iterations. Our approach learns in the way of offline RL from multiple reward functions designed to recognize human emotional details. Empirical studies demonstrate that our algorithm outperforms the baseline methods. Further verification presents that ours can generate responses with higher expected rewards and controllability.
翻译:传统上,由于自然语言动作空间极其庞大,应用于对话生成的近似动态规划通常涉及带有动作采样的策略改进。然而,这种方法在强化学习(RL)中效率低下,因为可行(高动作价值)的回应非常稀疏,且由随机采样维持的贪心策略是软弱的。本文通过理论和实验表明,对话策略的性能与采样规模正相关。我们引入了一种新颖的双粒度Q函数,通过探索最有希望的回应类别来干预采样,从而缓解这一局限。该方法按照粒度层次提取动作,能够在更少的策略迭代次数中达到最优。我们的学习方式采用离线强化学习,利用多个为识别人类情感细节而设计的奖励函数。实证研究表明,我们的算法优于基线方法。进一步验证表明,该方法能生成具有更高期望回报和可控性的回应。