In offline reinforcement learning (RL), one detrimental issue to policy learning is the error accumulation of deep Q function in out-of-distribution (OOD) areas. Unfortunately, existing offline RL methods are often over-conservative, inevitably hurting generalization performance outside data distribution. In our study, one interesting observation is that deep Q functions approximate well inside the convex hull of training data. Inspired by this, we propose a new method, DOGE (Distance-sensitive Offline RL with better GEneralization). DOGE marries dataset geometry with deep function approximators in offline RL, and enables exploitation in generalizable OOD areas rather than strictly constraining policy within data distribution. Specifically, DOGE trains a state-conditioned distance function that can be readily plugged into standard actor-critic methods as a policy constraint. Simple yet elegant, our algorithm enjoys better generalization compared to state-of-the-art methods on D4RL benchmarks. Theoretical analysis demonstrates the superiority of our approach to existing methods that are solely based on data distribution or support constraints.
翻译:在离线强化学习(RL)中,策略学习的一个有害问题是深度Q函数在分布外(OOD)区域的误差累积。不幸的是,现有的离线RL方法往往过于保守,不可避免地损害了数据分布之外的泛化性能。在我们的研究中,一个有趣的观察是深度Q函数在训练数据凸包内近似表现良好。受此启发,我们提出了一种新方法,DOGE(具有更好泛化性的距离敏感离线强化学习)。DOGE将数据集几何与离线RL中的深度函数逼近器相结合,能够在可泛化的OOD区域进行利用,而不是严格将策略约束在数据分布内。具体来说,DOGE训练一个状态条件距离函数,可以便捷地插入标准Actor-Critic方法作为策略约束。简洁而优雅,我们的算法在D4RL基准测试上相比最先进方法展现出更好的泛化性。理论分析证明了我们的方法相较于仅基于数据分布或支持约束的现有方法的优越性。