Effective offline RL methods require properly handling out-of-distribution actions. Implicit Q-learning (IQL) addresses this by training a Q-function using only dataset actions through a modified Bellman backup. However, it is unclear which policy actually attains the values represented by this implicitly trained Q-function. In this paper, we reinterpret IQL as an actor-critic method by generalizing the critic objective and connecting it to a behavior-regularized implicit actor. This generalization shows how the induced actor balances reward maximization and divergence from the behavior policy, with the specific loss choice determining the nature of this tradeoff. Notably, this actor can exhibit complex and multimodal characteristics, suggesting issues with the conditional Gaussian actor fit with advantage weighted regression (AWR) used in prior methods. Instead, we propose using samples from a diffusion parameterized behavior policy and weights computed from the critic to then importance sampled our intended policy. We introduce Implicit Diffusion Q-learning (IDQL), combining our general IQL critic with the policy extraction method. IDQL maintains the ease of implementation of IQL while outperforming prior offline RL methods and demonstrating robustness to hyperparameters. Code is available at https://github.com/philippe-eecs/IDQL.
翻译:有效的离线强化学习方法需要妥善处理分布外动作问题。隐式Q学习(IQL)通过改进的贝尔曼备份机制仅利用数据集中的动作训练Q函数来解决此问题。然而,尚不明确何种策略能真正达到该隐式训练Q函数所表征的价值。本文通过泛化评论家目标函数并将其与行为正则化隐式行动者关联,重新将IQL诠释为一种行动者-评论家方法。这种泛化揭示了诱导行动者如何平衡奖励最大化与偏离行为策略的程度,其中特定的损失函数选择决定了这种权衡的性质。值得注意的是,该行动者可能呈现复杂的多模态特征,这表明先前方法中使用的条件高斯行动者与优势加权回归(AWR)拟合存在问题。为此,我们提出利用扩散参数化行为策略的样本,并根据评论家计算的权重对目标策略进行重要性采样。我们提出隐式扩散Q学习(IDQL),该方法将泛化后的IQL评论家与策略提取方法相结合。IDQL在保持IQL易于实现优势的同时,性能超越现有离线强化学习方法,并展现出对超参数的鲁棒性。代码已开源至https://github.com/philippe-eecs/IDQL。