Adversarial inverse reinforcement learning (AIRL) stands as a cornerstone approach in imitation learning, yet it faces criticisms from prior studies. In this paper, we rethink AIRL and respond to these criticisms. Criticism 1 lies in Inadequate Policy Imitation. We show that substituting the built-in algorithm with soft actor-critic (SAC) during policy updating (requires multi-iterations) significantly enhances the efficiency of policy imitation. Criticism 2 lies in Limited Performance in Transferable Reward Recovery Despite SAC Integration. While we find that SAC indeed exhibits a significant improvement in policy imitation, it introduces drawbacks to transferable reward recovery. We prove that the SAC algorithm itself is not feasible to disentangle the reward function comprehensively during the AIRL training process, and propose a hybrid framework, PPO-AIRL + SAC, for a satisfactory transfer effect. Criticism 3 lies in Unsatisfactory Proof from the Perspective of Potential Equilibrium. We reanalyze it from an algebraic theory perspective.
翻译:对抗性逆强化学习(AIRL)是模仿学习领域中的核心方法,但先前的研究对其提出了一些批评。本文重新审视了AIRL,并回应了这些质疑。质疑一:策略模仿不充分。我们证明,在策略更新过程中用软演员-评论家(SAC)算法(需多次迭代)替代内置算法,能显著提升策略模仿的效率。质疑二:即使融合SAC,可迁移奖励恢复的性能仍有限。我们发现虽然SAC在策略模仿上有显著改进,但它对可迁移奖励恢复引入了负面效应。我们证明SAC算法本身无法在AIRL训练过程中完全解耦奖励函数,并提出了一种混合框架PPO-AIRL+SAC以实现满意的迁移效果。质疑三:潜在均衡视角下的证明不够理想。我们通过代数理论视角重新分析了这一问题。