By reusing data throughout training, off-policy deep reinforcement learning algorithms offer improved sample efficiency relative to on-policy approaches. For continuous action spaces, the most popular methods for off-policy learning include policy improvement steps where a learned state-action ($Q$) value function is maximized over selected batches of data. These updates are often paired with regularization to combat associated overestimation of $Q$ values. With an eye toward safety, we revisit this strategy in environments with "mixed-sign" reward functions; that is, with reward functions that include independent positive (incentive) and negative (cost) terms. This setting is common in real-world applications, and may be addressed with or without constraints on the cost terms. We find the combination of function approximation and a term that maximizes $Q$ in the policy update to be problematic in such environments, because systematic errors in value estimation impact the contributions from the competing terms asymmetrically. This results in overemphasis of either incentives or costs and may severely limit learning. We explore two remedies to this issue. First, consistent with prior work, we find that periodic resetting of $Q$ and policy networks can be used to reduce value estimation error and improve learning in this setting. Second, we formulate novel off-policy actor-critic methods for both unconstrained and constrained learning that do not explicitly maximize $Q$ in the policy update. We find that this second approach, when applied to continuous action spaces with mixed-sign rewards, consistently and significantly outperforms state-of-the-art methods augmented by resetting. We further find that our approach produces agents that are both competitive with popular methods overall and more reliably competent on frequently-studied control problems that do not have mixed-sign rewards.
翻译:通过在整个训练过程中复用数据,离策略深度强化学习算法相较于在策略方法具有更高的样本效率。对于连续动作空间,最常用的离策略学习方法包含策略改进步骤,即对选定数据批次上学习到的状态-动作($Q$)值函数进行最大化。此类更新通常配合正则化以应对相关的$Q$值过估计问题。出于安全考虑,我们在具有"混合符号"奖励函数的环境下重新审视这一策略;此类奖励函数包含独立的正向(激励)项与负向(成本)项。这种设置常见于实际应用,且可在对成本项施加约束或不施加约束的情况下进行处理。我们发现,在此类环境中,函数近似与策略更新中最大化$Q$的项组合使用时存在问题,因为值估计的系统性误差会不对称地影响竞争项的贡献。这会导致过度强调激励或成本,并可能严重限制学习效果。我们探索了两种解决方案。首先,与先前研究一致,发现周期性重置$Q$网络与策略网络可减少值估计误差并改善此类环境下的学习。其次,我们提出了新颖的离策略演员-评论家方法,用于无约束学习和约束学习,且不在策略更新中显式最大化$Q$。我们发现,第二种方法应用于具有混合符号奖励的连续动作空间时,始终显著优于通过重置增强的现有最优方法。进一步研究表明,我们的方法产生的智能体在整体上既能与主流方法竞争,也能在那些不具有混合符号奖励的常见控制问题上实现更可靠的表现。