Adversarial inverse reinforcement learning (AIRL) stands as a cornerstone approach in imitation learning. This paper rethinks the two different angles of AIRL: policy imitation and transferable reward recovery. We begin with substituting the built-in algorithm in AIRL with soft actor-critic (SAC) during the policy optimization process to enhance sample efficiency, thanks to the off-policy formulation of SAC and identifiable Markov decision process (MDP) models with respect to AIRL. It indeed exhibits a significant improvement in policy imitation but accidentally brings drawbacks to transferable reward recovery. To learn this issue, we illustrate that the SAC algorithm itself is not feasible to disentangle the reward function comprehensively during the AIRL training process, and propose a hybrid framework, PPO-AIRL + SAC, for satisfactory transfer effect. Additionally, we analyze the capability of environments to extract disentangled rewards from an algebraic theory perspective.
翻译:对抗性逆强化学习(AIRL)是模仿学习中的基石方法。本文从策略模仿与可迁移奖励恢复两个不同角度对AIRL进行重新审视。我们首先将AIRL内置算法替换为柔性演员-评论家(SAC)进行策略优化,利用SAC的离策略公式及针对AIRL的可识别马尔可夫决策过程(MDP)模型,从而提升样本效率。该方法在策略模仿方面确实表现出显著改进,但却意外地导致可迁移奖励恢复性能下降。为探究该问题,我们阐明SAC算法本身无法在AIRL训练过程中全面解耦奖励函数,并提出一种混合框架PPO-AIRL + SAC以实现满意的迁移效果。此外,我们从代数理论视角分析了环境提取解耦奖励的能力。