Recent methods for imitation learning directly learn a $Q$-function using an implicit reward formulation rather than an explicit reward function. However, these methods generally require implicit reward regularization to improve stability and often mistreat absorbing states. Previous works show that a squared norm regularization on the implicit reward function is effective, but do not provide a theoretical analysis of the resulting properties of the algorithms. In this work, we show that using this regularizer under a mixture distribution of the policy and the expert provides a particularly illuminating perspective: the original objective can be understood as squared Bellman error minimization, and the corresponding optimization problem minimizes a bounded $\chi^2$-Divergence between the expert and the mixture distribution. This perspective allows us to address instabilities and properly treat absorbing states. We show that our method, Least Squares Inverse Q-Learning (LS-IQ), outperforms state-of-the-art algorithms, particularly in environments with absorbing states. Finally, we propose to use an inverse dynamics model to learn from observations only. Using this approach, we retain performance in settings where no expert actions are available.
翻译:近期模仿学习方法通过隐式奖励公式直接学习$Q$函数,而非显式奖励函数。然而,这类方法通常需要隐式奖励正则化以提升稳定性,且常会不当处理吸收状态。已有研究表明,对隐式奖励函数施加平方范数正则化是有效的,但未对该正则化引发的算法性质进行理论分析。本文证明,在策略与专家的混合分布下采用该正则化可提供极具启发性的视角:原始目标可理解为最小化平方贝尔曼误差,而相应优化问题则最小化专家分布与混合分布之间的有界$\chi^2$散度。这一视角使我们得以解决不稳定问题并恰当处理吸收状态。实验表明,我们的方法——最小二乘逆Q学习(LS-IQ)——在具有吸收状态的环境中尤其优于现有最优算法。最后,我们提出利用逆动力学模型仅从观测数据中学习。采用该方案后,即使缺乏专家动作数据,我们的方法仍能保持性能。