Due to the inability to interact with the environment, offline reinforcement learning (RL) methods face the challenge of estimating the Out-of-Distribution (OOD) points. Existing methods for addressing this issue either control policy to exclude the OOD action or make the $Q$ function pessimistic. However, these methods can be overly conservative or fail to identify OOD areas accurately. To overcome this problem, we propose a Constrained Policy optimization with Explicit Behavior density (CPED) method that utilizes a flow-GAN model to explicitly estimate the density of behavior policy. By estimating the explicit density, CPED can accurately identify the safe region and enable optimization within the region, resulting in less conservative learning policies. We further provide theoretical results for both the flow-GAN estimator and performance guarantee for CPED by showing that CPED can find the optimal $Q$-function value. Empirically, CPED outperforms existing alternatives on various standard offline reinforcement learning tasks, yielding higher expected returns.
翻译:由于无法与环境交互,离线强化学习方法面临估计分布外点的挑战。现有解决该问题的方法要么通过约束策略排除分布外动作,要么使Q函数变得悲观。然而,这些方法可能过于保守或无法准确识别分布外区域。为解决这一问题,我们提出一种基于显式行为密度的约束策略优化方法,该方法利用流生成对抗网络显式估计行为策略的密度。通过估计显式密度,CPED能够准确识别安全区域并在该区域内进行优化,从而得到更不保守的学习策略。我们进一步给出了流生成对抗网络估计器的理论结果以及CPED的性能保证,证明了CPED能够找到最优Q函数值。实验表明,在各类标准离线强化学习任务中,CPED优于现有替代方法,并取得了更高的期望回报。