Long-horizon tasks, which have a large discount factor, pose a challenge for most conventional reinforcement learning (RL) algorithms. Algorithms such as Value Iteration and Temporal Difference (TD) learning have a slow convergence rate and become inefficient in these tasks. When the transition distributions are given, PID VI was recently introduced to accelerate the convergence of Value Iteration using ideas from control theory. Inspired by this, we introduce PID TD Learning and PID Q-Learning algorithms for the RL setting in which only samples from the environment are available. We give theoretical analysis of their convergence and acceleration compared to their traditional counterparts. We also introduce a method for adapting PID gains in the presence of noise and empirically verify its effectiveness.
翻译:长期任务(具有较大折扣因子的任务)对大多数传统强化学习算法构成挑战。诸如值迭代和时间差分学习等算法收敛速度缓慢,在这些任务中效率低下。当转移概率分布已知时,近期提出的PID VI算法利用控制理论思想加速了值迭代的收敛。受此启发,我们针对仅能从环境中获取样本的强化学习场景,提出了PID TD学习和PID Q-Learning算法。我们对其收敛性以及与对应传统算法相比的加速效果进行了理论分析。此外,我们还提出了一种在存在噪声情况下调整PID增益的方法,并通过实验验证了其有效性。