In reinforcement learning, offline value function learning is the procedure of using an offline dataset to estimate the expected discounted return from each state when taking actions according to a fixed target policy. The stability of this procedure, i.e., whether it converges to its fixed-point, critically depends on the representations of the state-action pairs. Poorly learned representations can make value function learning unstable, or even divergent. Therefore, it is critical to stabilize value function learning by explicitly shaping the state-action representations. Recently, the class of bisimulation-based algorithms have shown promise in shaping representations for control. However, it is still unclear if this class of methods can stabilize value function learning. In this work, we investigate this question and answer it affirmatively. We introduce a bisimulation-based algorithm called kernel representations for offline policy evaluation (KROPE). KROPE uses a kernel to shape state-action representations such that state-action pairs that have similar immediate rewards and lead to similar next state-action pairs under the target policy also have similar representations. We show that KROPE: 1) learns stable representations and 2) leads to lower value error than baselines. Our analysis provides new theoretical insight into the stability properties of bisimulation-based methods and suggests that practitioners can use these methods for stable and accurate evaluation of offline reinforcement learning agents.
翻译:在强化学习中,离线值函数学习是指利用离线数据集来估计在遵循固定目标策略采取行动时,从每个状态出发所能获得的期望折扣回报的过程。该过程的稳定性——即是否收敛到其不动点——关键取决于状态-动作对的表征方式。学习不当的表征可能导致值函数学习不稳定甚至发散。因此,通过显式地塑造状态-动作表征来稳定值函数学习至关重要。近年来,基于双模拟的算法类别在控制任务的表征塑造方面展现出潜力。然而,这类方法能否稳定值函数学习仍不明确。在本工作中,我们对此问题进行了研究并给出了肯定回答。我们提出了一种基于双模拟的算法,称为离线策略评估的核表征(KROPE)。KROPE使用核函数来塑造状态-动作表征,使得在目标策略下具有相似即时奖励并导向相似下一状态-动作对的状态-动作对也具有相似的表征。我们证明KROPE:1)能够学习稳定的表征;2)与基线方法相比能获得更低的值误差。我们的分析为基于双模拟方法的稳定性提供了新的理论见解,并表明实践者可使用这些方法对离线强化学习智能体进行稳定且准确的评估。