Constrained policy search (CPS) is a fundamental problem in offline reinforcement learning, which is generally solved by advantage weighted regression (AWR). However, previous methods may still encounter out-of-distribution actions due to the limited expressivity of Gaussian-based policies. On the other hand, directly applying the state-of-the-art models with distribution expression capabilities (i.e., diffusion models) in the AWR framework is intractable since AWR requires exact policy probability densities, which is intractable in diffusion models. In this paper, we propose a novel approach, $\textbf{Diffusion-based Constrained Policy Search}$ (dubbed DiffCPS), which tackles the diffusion-based constrained policy search with the primal-dual method. The theoretical analysis reveals that strong duality holds for diffusion-based CPS problems, and upon introducing parameter approximation, an approximated solution can be obtained after $\mathcal{O}(1/\epsilon)$ number of dual iterations, where $\epsilon$ denotes the representation ability of the parametrized policy. Extensive experimental results based on the D4RL benchmark demonstrate the efficacy of our approach. We empirically show that DiffCPS achieves better or at least competitive performance compared to traditional AWR-based baselines as well as recent diffusion-based offline RL methods. The code is now available at https://github.com/felix-thu/DiffCPS.
翻译:约束策略搜索(CPS)是离线强化学习中的基本问题,通常通过优势加权回归(AWR)方法求解。然而,由于基于高斯分布的策略表达能力有限,现有方法仍可能面临超出分布范围的动作问题。另一方面,直接将具有分布表达能力的最先进模型(即扩散模型)应用于AWR框架存在困难,因为AWR需要精确的策略概率密度,而扩散模型难以精确计算该密度。本文提出了一种新方法——**基于扩散模型的约束策略搜索**(简称DiffCPS),该方法采用原始-对偶技巧解决扩散模型约束策略搜索问题。理论分析表明,扩散模型约束策略搜索问题具有强对偶性,且在引入参数近似后,通过$\mathcal{O}(1/\epsilon)$次对偶迭代即可获得近似解,其中$\epsilon$表示参数化策略的表达能力。基于D4RL基准的大量实验验证了该方法有效性。实验结果表明,与传统基于AWR的基线方法和近期基于扩散模型的离线强化学习方法相比,DiffCPS在性能上更优或至少具有竞争力。代码已开源:https://github.com/felix-thu/DiffCPS。