In this study, we investigate the DIstribution Correction Estimation (DICE) methods, an important line of work in offline reinforcement learning (RL) and imitation learning (IL). DICE-based methods impose state-action-level behavior constraint, which is an ideal choice for offline learning. However, they typically perform much worse than current state-of-the-art (SOTA) methods that solely use action-level behavior constraint. After revisiting DICE-based methods, we find there exist two gradient terms when learning the value function using true-gradient update: forward gradient (taken on the current state) and backward gradient (taken on the next state). Using forward gradient bears a large similarity to many offline RL methods, and thus can be regarded as applying action-level constraint. However, directly adding the backward gradient may degenerate or cancel out its effect if these two gradients have conflicting directions. To resolve this issue, we propose a simple yet effective modification that projects the backward gradient onto the normal plane of the forward gradient, resulting in an orthogonal-gradient update, a new learning rule for DICE-based methods. We conduct thorough theoretical analyses and find that the projected backward gradient brings state-level behavior regularization, which reveals the mystery of DICE-based methods: the value learning objective does try to impose state-action-level constraint, but needs to be used in a corrected way. Through toy examples and extensive experiments on complex offline RL and IL tasks, we demonstrate that DICE-based methods using orthogonal-gradient updates (O-DICE) achieve SOTA performance and great robustness.
翻译:本研究探究分布校正估计方法,这是离线强化学习与模仿学习领域的重要研究方向。基于DICE的方法通过施加状态-动作级行为约束,成为离线学习的理想选择。然而,其性能通常远逊于仅采用动作级行为约束的当前最优方法。在重新审视基于DICE的方法后,我们发现使用真实梯度更新价值函数时存在两种梯度项:前向梯度(作用于当前状态)与反向梯度(作用于下一状态)。前向梯度与多种离线强化学习方法高度相似,可视为施加动作级约束。但当这两个梯度方向存在冲突时,直接添加反向梯度可能削弱甚至抵消其作用。为解决该问题,我们提出简单有效的改进方案:将反向梯度投影至前向梯度的法平面,由此形成基于DICE方法的新学习规则——正交梯度更新。通过严格理论分析发现,投影后的反向梯度可引入状态级行为正则化,这揭示了DICE方法的核心奥秘:价值学习目标确实试图施加状态-动作级约束,但须以修正方式实施。通过简易示例及复杂离线强化学习/模仿学习任务的大量实验,我们证明采用正交梯度更新的DICE方法达到了最优性能与强鲁棒性。