Offline reinforcement learning (RL) methodologies enforce constraints on the policy to adhere closely to the behavior policy, thereby stabilizing value learning and mitigating the selection of out-of-distribution (OOD) actions during test time. Conventional approaches apply identical constraints for both value learning and test time inference. However, our findings indicate that the constraints suitable for value estimation may in fact be excessively restrictive for action selection during test time. To address this issue, we propose a Mildly Constrained Evaluation Policy (MCEP) for test time inference with a more constrained target policy for value estimation. Since the target policy has been adopted in various prior approaches, MCEP can be seamlessly integrated with them as a plug-in. We instantiate MCEP based on TD3-BC [Fujimoto and Gu, 2021] and AWAC [Nair et al., 2020] algorithms. The empirical results on MuJoCo locomotion tasks show that the MCEP significantly outperforms the target policy and achieves competitive results to state-of-the-art offline RL methods. The codes are open-sourced at https://github.com/egg-west/MCEP.git.
翻译:离线强化学习方法对策略施加约束,使其紧密贴近行为策略,从而稳定价值学习并减少测试阶段选择分布外动作的风险。传统方法对价值学习和测试推理采用相同的约束。然而,我们的研究发现,适用于价值估计的约束在测试阶段的动作选择中可能过于严格。为解决这一问题,我们提出一种轻度约束评估策略(MCEP)用于测试推理,同时采用更受约束的目标策略进行价值估计。由于目标策略已在多种先前方法中被采用,MCEP可作为插件无缝集成到这些方法中。我们基于TD3-BC [Fujimoto and Gu, 2021]和AWAC [Nair et al., 2021]算法实例化MCEP。在MuJoCo运动任务上的实验结果表明,MCEP显著优于目标策略,并取得了与最先进离线RL方法相竞争的结果。相关代码已开源在https://github.com/egg-west/MCEP.git。