Implicit Q-learning (IQL) serves as a strong baseline for offline RL, which learns the value function using only dataset actions through quantile regression. However, it is unclear how to recover the implicit policy from the learned implicit Q-function and why IQL can utilize weighted regression for policy extraction. IDQL reinterprets IQL as an actor-critic method and gets weights of implicit policy, however, this weight only holds for the optimal value function. In this work, we introduce a different way to solve the implicit policy-finding problem (IPF) by formulating this problem as an optimization problem. Based on this optimization problem, we further propose two practical algorithms AlignIQL and AlignIQL-hard, which inherit the advantages of decoupling actor from critic in IQL and provide insights into why IQL can use weighted regression for policy extraction. Compared with IQL and IDQL, we find our method keeps the simplicity of IQL and solves the implicit policy-finding problem. Experimental results on D4RL datasets show that our method achieves competitive or superior results compared with other SOTA offline RL methods. Especially in complex sparse reward tasks like Antmaze and Adroit, our method outperforms IQL and IDQL by a significant margin.
翻译:隐式Q学习(IQL)作为离线强化学习的强基线,通过分位数回归仅利用数据集中的动作来学习价值函数。然而,如何从学习到的隐式Q函数中恢复隐式策略,以及为何IQL能够使用加权回归进行策略提取,这些问题尚不明确。IDQL将IQL重新解释为演员-评论家方法并获得了隐式策略的权重,但该权重仅对最优价值函数成立。本工作提出一种通过将隐式策略求解问题(IPF)构建为优化问题的新解决思路。基于此优化问题,我们进一步提出两种实用算法AlignIQL与AlignIQL-hard,它们继承了IQL中演员与评论家解耦的优势,并为理解IQL为何能使用加权回归进行策略提取提供了理论依据。相较于IQL与IDQL,我们的方法在保持IQL简洁性的同时解决了隐式策略求解问题。在D4RL数据集上的实验结果表明,本方法相较于其他最先进的离线强化学习方法取得了具有竞争力或更优的性能。尤其在Antmaze和Adroit等复杂稀疏奖励任务中,本方法较IQL和IDQL取得了显著优势。