We study a class of reinforcement learning problems where the reward signals for policy learning are generated by a discriminator that is dependent on and jointly optimized with the policy. This interdependence between the policy and the discriminator leads to an unstable learning process because reward signals from an immature discriminator are noisy and impede policy learning, and conversely, an under-optimized policy impedes discriminator learning. We call this learning setting \textit{Internally Rewarded Reinforcement Learning} (IRRL) as the reward is not provided directly by the environment but \textit{internally} by the discriminator. In this paper, we formally formulate IRRL and present a class of problems that belong to IRRL. We theoretically derive and empirically analyze the effect of the reward function in IRRL and based on these analyses propose the clipped linear reward function. Experimental results show that the proposed reward function can consistently stabilize the training process by reducing the impact of reward noise, which leads to faster convergence and higher performance compared with baselines in diverse tasks.
翻译:我们研究了一类强化学习问题,其中用于策略学习的奖励信号由一个判别器生成,该判别器与策略相互依赖且联合优化。策略与判别器之间的这种相互依赖导致学习过程不稳定,因为不成熟判别器产生的奖励信号带有噪声,会阻碍策略学习;反之,优化不足的策略也会阻碍判别器学习。我们将这种学习设置称为“内部奖励强化学习”(Internally Rewarded Reinforcement Learning, IRRL),因为奖励并非直接由环境提供,而是由判别器“内部”生成。本文正式定义了IRRL,并给出了属于IRRL的一类问题。我们从理论上推导并实证分析了IRRL中奖励函数的影响,并基于这些分析提出了裁剪线性奖励函数。实验结果表明,所提出的奖励函数能够通过减少奖励噪声的影响来稳定训练过程,从而在多样化任务中实现比基线方法更快的收敛和更高的性能。