Most offline reinforcement learning (RL) methods suffer from the trade-off between improving the policy to surpass the behavior policy and constraining the policy to limit the deviation from the behavior policy as computing $Q$-values using out-of-distribution (OOD) actions will suffer from errors due to distributional shift. The recently proposed \textit{In-sample Learning} paradigm (i.e., IQL), which improves the policy by quantile regression using only data samples, shows great promise because it learns an optimal policy without querying the value function of any unseen actions. However, it remains unclear how this type of method handles the distributional shift in learning the value function. In this work, we make a key finding that the in-sample learning paradigm arises under the \textit{Implicit Value Regularization} (IVR) framework. This gives a deeper understanding of why the in-sample learning paradigm works, i.e., it applies implicit value regularization to the policy. Based on the IVR framework, we further propose two practical algorithms, Sparse $Q$-learning (SQL) and Exponential $Q$-learning (EQL), which adopt the same value regularization used in existing works, but in a complete in-sample manner. Compared with IQL, we find that our algorithms introduce sparsity in learning the value function, making them more robust in noisy data regimes. We also verify the effectiveness of SQL and EQL on D4RL benchmark datasets and show the benefits of in-sample learning by comparing them with CQL in small data regimes.
翻译:大多数离线强化学习方法在改进策略以超越行为策略和约束策略以限制其偏离行为策略之间存在权衡,因为使用分布外动作计算Q值会因分布偏移而导致误差。最近提出的样本内学习范式(如IQL)通过仅使用数据样本进行分位数回归来改进策略,展现出巨大潜力,因为它无需查询任何未见动作的值函数即可学习最优策略。然而,这类方法如何在值函数学习中处理分布偏移尚不清楚。在本工作中,我们有一个关键发现:样本内学习范式是在隐式值正则化框架下产生的。这深入解释了样本内学习范式为何有效,即它对策略施加了隐式值正则化。基于IVR框架,我们进一步提出了两种实用算法:稀疏Q学习和指数Q学习,它们采用了现有工作中相同的值正则化,但以完全样本内方式实现。与IQL相比,我们发现我们的算法在学习值函数时引入了稀疏性,使其在噪声数据场景下更加鲁棒。我们还在D4RL基准数据集上验证了SQL和EQL的有效性,并通过在小数据场景下与CQL的比较展示了样本内学习的优势。