Many reinforcement learning approaches rely on temporal-difference (TD) learning to learn a critic. However, TD-learning updates can be high variance. Here, we introduce a model-based RL framework, Taylor TD, which reduces this variance in continuous state-action settings. Taylor TD uses a first-order Taylor series expansion of TD updates. This expansion allows Taylor TD to analytically integrate over stochasticity in the action-choice, and some stochasticity in the state distribution for the initial state and action of each TD update. We include theoretical and empirical evidence that Taylor TD updates are indeed lower variance than standard TD updates. Additionally, we show Taylor TD has the same stable learning guarantees as standard TD-learning with linear function approximation under a reasonable assumption. Next, we combine Taylor TD with the TD3 algorithm, forming TaTD3. We show TaTD3 performs as well, if not better, than several state-of-the art model-free and model-based baseline algorithms on a set of standard benchmark tasks.
翻译:许多强化学习方法依赖时间差分(TD)学习来训练评论员网络。然而,TD学习更新的方差往往较大。本文提出一种基于模型的强化学习框架——泰勒TD,可在连续状态-动作空间中有效降低该方差。泰勒TD通过对TD更新进行一阶泰勒级数展开,使得该框架能够解析地整合动作选择过程中的随机性,以及每次TD更新中初始状态和动作所涉及的部分状态分布随机性。我们通过理论分析和实验证据表明,泰勒TD更新确实比标准TD更新具有更低的方差。此外,在线性函数逼近的合理假设下,泰勒TD具有与标准TD学习相同的稳定学习保证。最后,我们将泰勒TD与TD3算法结合,形成TaTD3算法。在一组标准基准任务上的实验表明,TaTD3的性能至少与多种最先进的无模型和基于模型的基线算法相当,甚至更优。