In offline reinforcement learning (RL), one detrimental issue to policy learning is the error accumulation of deep Q function in out-of-distribution (OOD) areas. Unfortunately, existing offline RL methods are often over-conservative, inevitably hurting generalization performance outside data distribution. In our study, one interesting observation is that deep Q functions approximate well inside the convex hull of training data. Inspired by this, we propose a new method, DOGE (Distance-sensitive Offline RL with better GEneralization). DOGE marries dataset geometry with deep function approximators in offline RL, and enables exploitation in generalizable OOD areas rather than strictly constraining policy within data distribution. Specifically, DOGE trains a state-conditioned distance function that can be readily plugged into standard actor-critic methods as a policy constraint. Simple yet elegant, our algorithm enjoys better generalization compared to state-of-the-art methods on D4RL benchmarks. Theoretical analysis demonstrates the superiority of our approach to existing methods that are solely based on data distribution or support constraints.
翻译:在离线强化学习中,一个对策略学习有害的问题是深度Q函数在分布外区域的误差累积。不幸的是,现有的离线强化学习方法往往过于保守,不可避免地损害了数据分布之外的泛化性能。在我们的研究中,一个有趣的发现是深度Q函数在训练数据的凸包内部能够很好地近似。受此启发,我们提出了一种新方法——DOGE(具有更好泛化能力的距离敏感型离线强化学习)。DOGE将数据集几何与离线强化学习中的深度函数逼近器相结合,并能够在可泛化的分布外区域进行探索,而不是严格地将策略约束在数据分布内。具体而言,DOGE训练了一个状态条件距离函数,该函数可以方便地插入标准的演员-评论家方法中作为策略约束。简洁而优雅,我们的算法在D4RL基准测试中相较于最先进的方法具有更好的泛化性能。理论分析表明,我们的方法优于仅基于数据分布或支持约束的现有方法。