Deep neural networks provide Reinforcement Learning (RL) powerful function approximators to address large-scale decision-making problems. However, these approximators introduce challenges due to the non-stationary nature of RL training. One source of the challenges in RL is that output predictions can churn, leading to uncontrolled changes after each batch update for states not included in the batch. Although such a churn phenomenon exists in each step of network training, how churn occurs and impacts RL remains under-explored. In this work, we start by characterizing churn in a view of Generalized Policy Iteration with function approximation, and we discover a chain effect of churn that leads to a cycle where the churns in value estimation and policy improvement compound and bias the learning dynamics throughout the iteration. Further, we concretize the study and focus on the learning issues caused by the chain effect in different settings, including greedy action deviation in value-based methods, trust region violation in proximal policy optimization, and dual bias of policy value in actor-critic methods. We then propose a method to reduce the chain effect across different settings, called Churn Approximated ReductIoN (CHAIN), which can be easily plugged into most existing DRL algorithms. Our experiments demonstrate the effectiveness of our method in both reducing churn and improving learning performance across online and offline, value-based and policy-based RL settings, as well as a scaling setting.
翻译:深度神经网络为强化学习提供了强大的函数逼近器,以应对大规模决策问题。然而,由于强化学习训练的非平稳特性,这些逼近器也带来了挑战。强化学习中的挑战来源之一是输出预测可能发生波动,导致未包含在批次中的状态在每次批次更新后产生不受控的变化。尽管这种波动现象存在于网络训练的每一步,但波动如何发生及其对强化学习的影响仍未得到充分探索。在本工作中,我们首先从函数逼近的广义策略迭代视角刻画波动现象,并发现了一种波动链式效应:价值估计与策略改进中的波动相互叠加,在迭代过程中形成循环,使学习动态产生偏差。进一步,我们具体化该研究,聚焦于链式效应在不同设置下引发的学习问题,包括基于价值方法中的贪婪动作偏差、近端策略优化中的信任区域违反,以及行动者-评论家方法中策略价值的双重偏差。随后,我们提出一种名为“波动近似消减”的方法,旨在降低不同设置下的链式效应,该方法可轻松集成到大多数现有深度强化学习算法中。实验结果表明,我们的方法在在线与离线、基于价值与基于策略的强化学习设置以及扩展性设置中,均能有效降低波动并提升学习性能。