Long-horizon tasks, which have a large discount factor, pose a challenge for most conventional reinforcement learning (RL) algorithms. Algorithms such as Value Iteration and Temporal Difference (TD) learning have a slow convergence rate and become inefficient in these tasks. When the transition distributions are given, PID VI was recently introduced to accelerate the convergence of Value Iteration using ideas from control theory. Inspired by this, we introduce PID TD Learning and PID Q-Learning algorithms for the RL setting, in which only samples from the environment are available. We give a theoretical analysis of the convergence of PID TD Learning and its acceleration compared to the conventional TD Learning. We also introduce a method for adapting PID gains in the presence of noise and empirically verify its effectiveness.
翻译:长期任务因其较大的折扣因子,对大多数传统强化学习算法构成了挑战。诸如值迭代和时间差分学习等算法在这些任务中收敛速度缓慢且效率低下。当状态转移分布已知时,近期提出的PID值迭代算法利用控制理论思想加速了值迭代的收敛。受此启发,本文针对仅能从环境中获取样本的强化学习场景,提出了PID时间差分学习算法与PID Q-学习算法。我们从理论上分析了PID时间差分学习的收敛性及其相较于传统时间差分学习的加速效果。同时,我们提出了一种在噪声环境下自适应调整PID增益的方法,并通过实验验证了其有效性。