Offline RL algorithms aim to improve upon the behavior policy that produces the collected data while constraining the learned policy to be within the support of the dataset. However, practical offline datasets often contain examples with little diversity or limited exploration of the environment, and from multiple behavior policies with diverse expertise levels. Limited exploration can impair the offline RL algorithm's ability to estimate \textit{Q} or \textit{V} values, while constraining towards diverse behavior policies can be overly conservative. Such datasets call for a balance between the RL objective and behavior policy constraints. We first identify the connection between $f$-divergence and optimization constraint on the Bellman residual through a more general Linear Programming form for RL and the convex conjugate. Following this, we introduce the general flexible function formulation for the $f$-divergence to incorporate an adaptive constraint on algorithms' learning objectives based on the offline training dataset. Results from experiments on the MuJoCo, Fetch, and AdroitHand environments show the correctness of the proposed LP form and the potential of the flexible $f$-divergence in improving performance for learning from a challenging dataset when applied to a compatible constrained optimization algorithm.
翻译:离线强化学习算法旨在改进生成收集数据的行为策略,同时将学习策略约束在数据集的支持范围内。然而,实际离线数据集通常包含多样性不足或环境探索有限的样本,且可能来自具有不同专业水平的多重行为策略。有限的探索会削弱离线强化学习算法估计\textit{Q}值或\textit{V}值的能力,而对多样化行为策略的约束则可能过于保守。此类数据集需要在强化学习目标与行为策略约束之间寻求平衡。我们首先通过强化学习中更一般的线性规划形式及其凸共轭,揭示了$f$-散度与贝尔曼残差优化约束之间的关联。基于此,我们提出$f$-散度的通用灵活函数形式,以根据离线训练数据集对算法学习目标施加自适应约束。在MuJoCo、Fetch和AdroitHand环境中的实验结果表明,所提出的线性规划形式的正确性,以及灵活$f$-散度在应用于兼容的约束优化算法时,能够有效提升从挑战性数据集学习的性能。