Off-policy learning ability is an important feature of reinforcement learning (RL) for practical applications. However, even one of the most elementary RL algorithms, temporal-difference (TD) learning, is known to suffer form divergence issue when the off-policy scheme is used together with linear function approximation. To overcome the divergent behavior, several off-policy TD-learning algorithms, including gradient-TD learning (GTD), and TD-learning with correction (TDC), have been developed until now. In this work, we provide a unified view of such algorithms from a purely control-theoretic perspective, and propose a new convergent algorithm. Our method relies on the backstepping technique, which is widely used in nonlinear control theory. Finally, convergence of the proposed algorithm is experimentally verified in environments where the standard TD-learning is known to be unstable.
翻译:离线策略学习能力是强化学习在实际应用中的重要特性。然而,即便作为最基础的强化学习算法之一,时序差分学习在结合线性函数逼近采用离线策略方案时,仍存在已知的收敛性问题。为克服这一发散行为,目前已发展出包括梯度时序差分学习和带修正的时序差分学习在内的多种离线策略时序差分学习算法。本研究从纯控制理论视角为这些算法提供统一解释,并提出了新型收敛算法。该方法依赖于非线性控制理论中广泛使用的反向步进技术。最后,在已知标准时序差分学习不稳定的环境中,通过实验验证了所提算法的收敛性。