Adversarial inverse reinforcement learning (AIRL) stands as a cornerstone approach in imitation learning, yet it frequently faces criticisms from prior studies. In this paper, we rethink AIRL and respond to these criticisms. Criticism 1 lies in Inadequate Policy Imitation. We show that substituting the built-in algorithm with soft actor-critic (SAC) during policy updating (requires multi-iterations) significantly enhances the efficiency of policy imitation. Criticism 2 lies in Limited Performance in Transferable Reward Recovery Despite SAC Integration. While we find that SAC indeed exhibits a significant improvement in policy imitation, it introduces drawbacks to transferable reward recovery. We prove that the SAC algorithm itself is not feasible to disentangle the reward function comprehensively during the AIRL training process, and propose a hybrid framework, PPO-AIRL + SAC, for a satisfactory transfer effect. Criticism 3 lies in Unsatisfactory Proof from the Perspective of Potential Equilibrium. We reanalyze it from an algebraic theory perspective.
翻译:对抗性逆强化学习(Adversarial Inverse Reinforcement Learning, AIRL)作为模仿学习中的基石方法,却频繁受到先前研究的批评。本文重新审视AIRL,并对这些批评进行回应。批评一聚焦于策略模仿不充分。我们证明,在策略更新过程中(需多次迭代),将内置算法替换为软演员-评论家(Soft Actor-Critic, SAC)可显著提升策略模仿效率。批评二指出,即使引入SAC集成,可迁移奖励恢复的性能仍有限。我们发现,尽管SAC确实在策略模仿方面表现出显著改进,但它对可迁移奖励恢复带来了负面影响。我们证明SAC算法本身无法在AIRL训练过程中全面解耦奖励函数,并提出一种混合框架PPO-AIRL + SAC,以实现令人满意的迁移效果。批评三涉及从潜在均衡视角的证明不充分。我们从代数理论角度重新分析了这一问题。