Offline reinforcement learning (RL) is a compelling framework for learning optimal policies from past experiences without additional interaction with the environment. Nevertheless, offline RL inevitably faces the problem of distributional shifts, where the states and actions encountered during policy execution may not be in the training dataset distribution. A common solution involves incorporating conservatism into the policy or the value function to safeguard against uncertainties and unknowns. In this work, we focus on achieving the same objectives of conservatism but from a different perspective. We propose COmpositional COnservatism with Anchor-seeking (COCOA) for offline RL, an approach that pursues conservatism in a compositional manner on top of the transductive reparameterization (Netanyahu et al., 2023), which decomposes the input variable (the state in our case) into an anchor and its difference from the original input. Our COCOA seeks both in-distribution anchors and differences by utilizing the learned reverse dynamics model, encouraging conservatism in the compositional input space for the policy or value function. Such compositional conservatism is independent of and agnostic to the prevalent behavioral conservatism in offline RL. We apply COCOA to four state-of-the-art offline RL algorithms and evaluate them on the D4RL benchmark, where COCOA generally improves the performance of each algorithm. The code is available at https://github.com/runamu/compositional-conservatism.
翻译:离线强化学习(offline RL)是一种在不与环境额外交互的情况下,从过往经验中学习最优策略的引人注目的框架。然而,离线强化学习不可避免地面临分布偏移问题,即策略执行过程中遇到的状态和动作可能不在训练数据集的分布之内。常见的解决方案是将保守性纳入策略或价值函数中,以防范不确定性和未知情况。在本工作中,我们从不同角度出发,旨在实现相同的保守性目标。我们提出了一种用于离线强化学习的组合保守锚定搜索方法(COCOA),该方法在转导重参数化(Netanyahu et al., 2023)的基础上以组合方式追求保守性,后者将输入变量(本文中的状态)分解为一个锚点及其与原始输入的差异。我们的COCOA通过利用学习到的逆向动力学模型,同时搜索分布内的锚点和差异,从而在策略或价值函数的组合输入空间中鼓励保守性。这种组合保守性与离线强化学习中普遍的行为保守性相互独立且无关。我们将COCOA应用于四种最先进的离线强化学习算法,并在D4RL基准上进行了评估,结果表明COCOA普遍提升了每种算法的性能。代码已开源在https://github.com/runamu/compositional-conservatism。