Offline Reinforcement Learning (RL) relies on policy constraints to mitigate extrapolation error, where both the constraint form and constraint strength critically shape performance. However, most existing methods commit to a single constraint family: weighted behavior cloning, density regularization, or support constraints, without a unified principle that explains their connections or trade-offs. In this work, we propose Continuous Constraint Interpolation (CCI), a unified optimization framework in which these three constraint families arise as special cases along a common constraint spectrum. The CCI framework introduces a single interpolation parameter that enables smooth transitions and principled combinations across constraint types. Building on CCI, we develop Automatic Constraint Policy Optimization (ACPO), a practical primal--dual algorithm that adapts the interpolation parameter via a Lagrangian dual update. Moreover, we establish a maximum-entropy performance difference lemma and derive performance lower bounds for both the closed-form optimal policy and its parametric projection. Experiments on D4RL and NeoRL2 demonstrate robust gains across diverse domains, achieving state-of-the-art performance overall.
翻译:离线强化学习(RL)依赖于策略约束来减轻外推误差,其中约束形式和约束强度对性能具有关键影响。然而,现有方法大多局限于单一约束族:加权行为克隆、密度正则化或支撑约束,缺乏一个统一的原则来解释它们之间的联系与权衡。本文提出连续约束插值(CCI),这是一个统一的优化框架,其中上述三种约束族作为特例出现在一个共同的约束谱上。CCI框架引入了一个单一的插值参数,能够在不同约束类型之间实现平滑过渡和原则性组合。基于CCI,我们开发了自动约束策略优化(ACPO),这是一种实用的原对偶算法,通过拉格朗日对偶更新自适应调整插值参数。此外,我们建立了最大熵性能差异引理,并推导了闭式最优策略及其参数化投影的性能下界。在D4RL和NeoRL2上的实验表明,该方法在不同领域均取得了稳健的性能提升,整体达到了最先进的性能水平。