In offline reinforcement learning, a policy is learned using a static dataset in the absence of costly feedback from the environment. In contrast to the online setting, only using static datasets poses additional challenges, such as policies generating out-of-distribution samples. Model-based offline reinforcement learning methods try to overcome these by learning a model of the underlying dynamics of the environment and using it to guide policy search. It is beneficial but, with limited datasets, errors in the model and the issue of value overestimation among out-of-distribution states can worsen performance. Current model-based methods apply some notion of conservatism to the Bellman update, often implemented using uncertainty estimation derived from model ensembles. In this paper, we propose Constrained Latent Action Policies (C-LAP) which learns a generative model of the joint distribution of observations and actions. We cast policy learning as a constrained objective to always stay within the support of the latent action distribution, and use the generative capabilities of the model to impose an implicit constraint on the generated actions. Thereby eliminating the need to use additional uncertainty penalties on the Bellman update and significantly decreasing the number of gradient steps required to learn a policy. We empirically evaluate C-LAP on the D4RL and V-D4RL benchmark, and show that C-LAP is competitive to state-of-the-art methods, especially outperforming on datasets with visual observations.
翻译:在离线强化学习中,策略是通过静态数据集学习的,无需环境提供代价高昂的反馈。与在线设置相比,仅使用静态数据集带来了额外的挑战,例如策略可能生成分布外样本。基于模型的离线强化学习方法试图通过学习环境底层动态模型并利用其指导策略搜索来克服这些挑战。这种方法是有益的,但在数据集有限的情况下,模型误差以及分布外状态下的价值高估问题可能会恶化性能。当前基于模型的方法在贝尔曼更新中应用某种保守性概念,通常通过模型集成推导的不确定性估计来实现。本文提出约束潜在动作策略(C-LAP),该方法学习观测与动作联合分布的生成模型。我们将策略学习构建为约束优化目标,确保始终保持在潜在动作分布的支持集内,并利用模型的生成能力对生成的动作施加隐式约束。由此无需在贝尔曼更新中使用额外的不确定性惩罚项,并显著减少了学习策略所需的梯度步数。我们在D4RL和V-D4RL基准测试中对C-LAP进行了实证评估,结果表明C-LAP与最先进方法相比具有竞争力,尤其在包含视觉观测的数据集上表现更优。