Traditionally, approximate dynamic programming is employed in dialogue generation with greedy policy improvement through action sampling, as the natural language action space is vast. However, this practice is inefficient for reinforcement learning (RL) due to the sparsity of eligible responses with high action values, which leads to weak improvement sustained by random sampling. This paper presents theoretical analysis and experiments that reveal the performance of the dialogue policy is positively correlated with the sampling size. To overcome this limitation, we introduce a novel dual-granularity Q-function that explores the most promising response category to intervene in the sampling process. Our approach extracts actions based on a grained hierarchy, thereby achieving the optimum with fewer policy iterations. Additionally, we use offline RL and learn from multiple reward functions designed to capture emotional nuances in human interactions. Empirical studies demonstrate that our algorithm outperforms baselines across automatic metrics and human evaluations. Further testing reveals that our algorithm exhibits both explainability and controllability and generates responses with higher expected rewards.
翻译:传统上,对话生成采用近似动态规划,通过动作采样的贪婪策略优化,因为自然语言动作空间庞大。然而,这一方法对强化学习而言效率低下,因为具有高动作值的合格响应稀疏,导致随机采样支持的改进效果微弱。本文通过理论分析和实验揭示了对话策略性能与采样规模正相关。为克服这一局限,我们引入一种新颖的双粒度Q函数,探索最有前景的响应类别以干预采样过程。该方法基于分层粒度提取动作,从而以更少的策略迭代达到最优。此外,我们采用离线强化学习,并利用多个为捕捉人类交互中情感细微差别而设计的奖励函数进行学习。实证研究表明,我们的算法在自动评估指标和人工评估上均优于基线方法。进一步测试显示,该算法兼具可解释性与可控性,并能生成具有更高期望奖励的响应。