We consider the problem of learning the best possible policy from a fixed dataset, known as offline Reinforcement Learning (RL). A common taxonomy of existing offline RL works is policy regularization, which typically constrains the learned policy by distribution or support of the behavior policy. However, distribution and support constraints are overly conservative since they both force the policy to choose similar actions as the behavior policy when considering particular states. It will limit the learned policy's performance, especially when the behavior policy is sub-optimal. In this paper, we find that regularizing the policy towards the nearest state-action pair can be more effective and thus propose Policy Regularization with Dataset Constraint (PRDC). When updating the policy in a given state, PRDC searches the entire dataset for the nearest state-action sample and then restricts the policy with the action of this sample. Unlike previous works, PRDC can guide the policy with proper behaviors from the dataset, allowing it to choose actions that do not appear in the dataset along with the given state. It is a softer constraint but still keeps enough conservatism from out-of-distribution actions. Empirical evidence and theoretical analysis show that PRDC can alleviate offline RL's fundamentally challenging value overestimation issue with a bounded performance gap. Moreover, on a set of locomotion and navigation tasks, PRDC achieves state-of-the-art performance compared with existing methods. Code is available at https://github.com/LAMDA-RL/PRDC
翻译:我们考虑从固定数据集中学习最优策略的问题,即离线强化学习。现有离线强化学习工作的常见分类是策略正则化,通常通过行为策略的分布或支持集来约束学习到的策略。然而,分布约束和支持集约束过于保守,因为它们在考虑特定状态时都强制策略选择与行为策略相似的动作。这将限制学习策略的性能,特别是当行为策略是次优时。本文发现,对策略进行朝向最近状态-动作对的正则化可能更有效,因此提出基于数据集约束的策略正则化(PRDC)。当在给定状态下更新策略时,PRDC在整个数据集中搜索最近的状态-动作样本,然后用该样本的动作来约束策略。与先前工作不同,PRDC能通过数据集中的适当行为引导策略,允许其选择数据集之外与给定状态相关的动作。这是一种更柔和的约束,但仍能对分布外动作保持足够的保守性。经验证据和理论分析表明,PRDC能以有界的性能差距缓解离线强化学习根本性的值高估问题。此外,在一系列运动控制和导航任务中,与现有方法相比,PRDC实现了最先进的性能。代码开源在 https://github.com/LAMDA-RL/PRDC。