Off-policy deep reinforcement learning algorithms commonly compensate for overestimation bias during temporal-difference learning by utilizing pessimistic estimates of the expected target returns. In this work, we propose Generalized Pessimism Learning (GPL), a strategy employing a novel learnable penalty to enact such pessimism. In particular, we propose to learn this penalty alongside the critic with dual TD-learning, a new procedure to estimate and minimize the magnitude of the target returns bias with trivial computational cost. GPL enables us to accurately counteract overestimation bias throughout training without incurring the downsides of overly pessimistic targets. By integrating GPL with popular off-policy algorithms, we achieve state-of-the-art results in both competitive proprioceptive and pixel-based benchmarks.
翻译:离策略深度强化学习算法在时序差分学习中,常通过采用预期目标回报的悲观估计来补偿过高估计偏差。本文提出广义悲观性学习(Generalized Pessimism Learning, GPL),这是一种通过新颖的可学习惩罚项来实现此类悲观性的策略。具体而言,我们提出使用双TD学习(dual TD-learning)来学习此惩罚项与评论家网络。双TD学习是一种以极低计算成本估计并最小化目标回报偏差幅度的新方法。GPL使我们能够在训练过程中精确抵消过高估计偏差,同时避免因过度悲观的估计目标带来的负面影响。通过将GPL与主流离策略算法相结合,我们在竞争性的本体感知与像素化基准测试中均取得了最先进的结果。