Offline Reinforcement Learning (RL) faces a fundamental challenge of extrapolation errors caused by out-of-distribution (OOD) actions. Implicit Q-Learning (IQL) employs expectile regression to achieve in-sample learning. Nevertheless, IQL relies on a fixed expectile hyperparameter and a density-based policy improvement method, both of which impede its adaptability and performance. In this paper, we propose Projective IQL (PIQL), a projective variant of IQL enhanced with a support constraint. In the policy evaluation stage, PIQL substitutes the fixed expectile hyperparameter with a projection-based parameter and extends the one-step value estimation to a multi-step formulation. In the policy improvement stage, PIQL adopts a support constraint instead of a density constraint, ensuring closer alignment with the policy evaluation. Theoretically, we demonstrate that PIQL maintains the expectile regression and in-sample learning framework, guarantees monotonic policy improvement, and introduces a progressively more rigorous criterion for advantageous actions. Experiments on D4RL and NeoRL2 benchmarks demonstrate robust gains across diverse domains, achieving state-of-the-art performance overall.
翻译:离线强化学习面临由分布外动作引起的泛化误差这一根本性挑战。隐式Q学习采用期望回归实现样本内学习。然而,IQL依赖于固定的期望分位数超参数和基于密度的策略改进方法,这两者均限制了其适应性与性能。本文提出投影隐式Q学习,这是一种通过支撑约束增强的IQL投影变体。在策略评估阶段,PIQL将固定的期望分位数超参数替换为基于投影的参数,并将单步价值估计扩展为多步形式。在策略改进阶段,PIQL采用支撑约束替代密度约束,确保与策略评估更紧密对齐。理论上,我们证明PIQL保持了期望回归与样本内学习框架,保证了策略的单调改进,并为优势动作引入了逐步更严格的判定准则。在D4RL和NeoRL2基准测试上的实验表明,该方法在多个领域均取得显著性能提升,整体达到最先进的性能水平。