Reinforcement Learning (RL) is a widely employed technique in decision-making problems, encompassing two fundamental operations -- policy evaluation and policy improvement. Enhancing learning efficiency remains a key challenge in RL, with many efforts focused on using ensemble critics to boost policy evaluation efficiency. However, when using multiple critics, the actor in the policy improvement process can obtain different gradients. Previous studies have combined these gradients without considering their disagreements. Therefore, optimizing the policy improvement process is crucial to enhance learning efficiency. This study focuses on investigating the impact of gradient disagreements caused by ensemble critics on policy improvement. We introduce the concept of uncertainty of gradient directions as a means to measure the disagreement among gradients utilized in the policy improvement process. Through measuring the disagreement among gradients, we find that transitions with lower uncertainty of gradient directions are more reliable in the policy improvement process. Building on this analysis, we propose a method called von Mises-Fisher Experience Resampling (vMFER), which optimizes the policy improvement process by resampling transitions and assigning higher confidence to transitions with lower uncertainty of gradient directions. Our experiments demonstrate that vMFER significantly outperforms the benchmark and is particularly well-suited for ensemble structures in RL.
翻译:强化学习是决策问题中广泛采用的技术,包含两个基本操作——策略评估与策略改进。提升学习效率仍是强化学习的核心挑战,众多研究致力于通过集成评论家来增强策略评估效率。然而,当使用多个评论家时,策略改进过程中的智能体可能获得不同梯度。已有研究在组合这些梯度时未考虑其分歧。因此,优化策略改进过程对提升学习效率至关重要。本研究聚焦于探索集成评论家引发的梯度分歧对策略改进的影响。我们引入梯度方向不确定性的概念,用以测量策略改进过程中所使用梯度间的分歧程度。通过测量梯度分歧,我们发现梯度方向不确定性较低的转移样本在策略改进过程中更为可靠。基于此分析,我们提出名为冯·米塞斯-费舍尔经验重采样(vMFER)的方法,该方法通过重采样转移样本,并对梯度方向不确定性较低的样本赋予更高置信度,从而优化策略改进过程。实验表明,vMFER 显著优于基准方法,尤其适用于强化学习中的集成结构。