Offline Goal-Conditioned Reinforcement Learning (GCRL) holds great promise for domains such as autonomous navigation and locomotion, where collecting interactive data is costly and unsafe. However, it remains challenging in practice due to the need to learn from datasets with limited coverage of the state-action space and to generalize across long-horizon tasks. To improve on these challenges, we propose a \emph{Physics-informed (Pi)} regularized loss for value learning, derived from the Eikonal Partial Differential Equation (PDE) and which induces a geometric inductive bias in the learned value function. Unlike generic gradient penalties that are primarily used to stabilize training, our formulation is grounded in continuous-time optimal control and encourages value functions to align with cost-to-go structures. The proposed regularizer is broadly compatible with temporal-difference-based value learning and can be integrated into existing Offline GCRL algorithms. When combined with Hierarchical Implicit Q-Learning (HIQL), the resulting method, Eikonal-regularized HIQL (Eik-HIQL), yields significant improvements in both performance and generalization, with pronounced gains in stitching regimes and large-scale navigation tasks.
翻译:离线目标条件强化学习(GCRL)在自主导航与运动控制等领域具有巨大潜力,这些领域通常面临交互数据采集成本高昂且安全性不足的问题。然而,由于需要从状态-动作空间覆盖有限的离线数据集中学习,并实现长时程任务的泛化,该技术在实践中仍面临挑战。为应对这些挑战,我们提出一种基于物理信息(Pi)的正则化损失函数用于值函数学习,该损失源自Eikonal偏微分方程(PDE),能为习得的值函数引入几何归纳偏置。与主要用于稳定训练的通用梯度惩罚方法不同,我们的方法立足于连续时间最优控制理论,促使值函数与代价递进结构保持一致。所提出的正则化器广泛兼容基于时序差分的值函数学习方法,并可集成至现有离线GCRL算法中。当与分层隐式Q学习(HIQL)结合时,所构建的Eikonal正则化HIQL方法(Eik-HIQL)在性能与泛化能力方面均取得显著提升,尤其在轨迹拼接场景与大规模导航任务中表现突出。