Offline reinforcement learning faces a significant challenge of value over-estimation due to the distributional drift between the dataset and the current learned policy, leading to learning failure in practice. The common approach is to incorporate a penalty term to reward or value estimation in the Bellman iterations. Meanwhile, to avoid extrapolation on out-of-distribution (OOD) states and actions, existing methods focus on conservative Q-function estimation. In this paper, we propose Conservative State Value Estimation (CSVE), a new approach that learns conservative V-function via directly imposing penalty on OOD states. Compared to prior work, CSVE allows more effective in-data policy optimization with conservative value guarantees. Further, we apply CSVE and develop a practical actor-critic algorithm in which the critic does the conservative value estimation by additionally sampling and penalizing the states \emph{around} the dataset, and the actor applies advantage weighted updates extended with state exploration to improve the policy. We evaluate in classic continual control tasks of D4RL, showing that our method performs better than the conservative Q-function learning methods and is strongly competitive among recent SOTA methods.
翻译:离线强化学习面临一个重大挑战:由于数据集与当前学习策略之间的分布偏移导致的价值过高估计,从而在实践中造成学习失败。常见的方法是在贝尔曼迭代中对奖励或价值估计引入惩罚项。同时,为避免对分布外状态和动作的外推,现有方法主要聚焦于保守的Q函数估计。本文提出了一种新方法——保守状态值估计(CSVE),通过直接对分布外状态施加惩罚来学习保守的V函数。与先前的工作相比,CSVE能够在提供保守价值保证的同时,更有效地进行数据内策略优化。进一步,我们将CSVE应用于实践,开发了一种演员-评论家算法:评论家通过对数据集*附近*状态进行额外采样和惩罚来实现保守值估计,演员则采用扩展了状态探索的优势加权更新来改进策略。我们在D4RL的经典连续控制任务上进行了评估,结果表明我们的方法优于保守Q函数学习方法,并在当前最先进方法中具有很强的竞争力。