Offline reinforcement learning learns from a static dataset without interacting with environments, which ensures security and thus owns a good application prospect. However, directly applying naive reinforcement learning algorithm usually fails in an offline environment due to inaccurate Q value approximation caused by out-of-distribution (OOD) state-actions. It is an effective way to solve this problem by penalizing the Q-value of OOD state-actions. Among the methods of punishing OOD state-actions, count-based methods have achieved good results in discrete domains in a simple form. Inspired by it, a novel pseudo-count method for continuous domains called Grid-Mapping Pseudo-Count method (GPC) is proposed by extending the count-based method from discrete to continuous domains. Firstly, the continuous state and action space are mapped to discrete space using Grid-Mapping, then the Q-values of OOD state-actions are constrained through pseudo-count. Secondly, the theoretical proof is given to show that GPC can obtain appropriate uncertainty constraints under fewer assumptions than other pseudo-count methods. Thirdly, GPC is combined with Soft Actor-Critic algorithm (SAC) to get a new algorithm called GPC-SAC. Lastly, experiments on D4RL datasets are given to show that GPC-SAC has better performance and less computational cost than other algorithms that constrain the Q-value.
翻译:离线强化学习从静态数据集中学习而无需与环境交互,这确保了安全性,因而具有良好的应用前景。然而,由于分布外状态-动作导致的Q值估计不准确,直接应用朴素的强化学习算法通常在离线环境中会失败。通过惩罚分布外状态-动作的Q值来解决该问题是一种有效途径。在惩罚分布外状态-动作的方法中,基于计数的方法以简单的形式在离散领域取得了良好效果。受此启发,本文通过将基于计数的方法从离散领域扩展到连续领域,提出了一种适用于连续领域的新型伪计数方法——网格映射伪计数方法。首先,利用网格映射将连续状态和动作空间映射到离散空间,然后通过伪计数约束分布外状态-动作的Q值。其次,通过理论证明表明,相较于其他伪计数方法,GPC能够在更少的假设条件下获得合适的不确定性约束。再次,将GPC与柔性演员-评论家算法相结合,得到新算法GPC-SAC。最后,在D4RL数据集上的实验表明,与其他约束Q值的算法相比,GPC-SAC具有更好的性能和更低的计算成本。