Adversarial inverse reinforcement learning (AIRL) stands as a cornerstone approach in imitation learning, yet it faces criticisms from prior studies. In this paper, we rethink AIRL and respond to these criticisms. Criticism 1 lies in Inadequate Policy Imitation. We show that substituting the built-in algorithm with soft actor-critic (SAC) during policy updating (requires multi-iterations) significantly enhances the efficiency of policy imitation. Criticism 2 lies in Limited Performance in Transferable Reward Recovery Despite SAC Integration. While we find that SAC indeed exhibits a significant improvement in policy imitation, it introduces drawbacks to transferable reward recovery. We prove that the SAC algorithm itself is not feasible to disentangle the reward function comprehensively during the AIRL training process, and propose a hybrid framework, PPO-AIRL + SAC, for a satisfactory transfer effect. Criticism 3 lies in Unsatisfactory Proof from the Perspective of Potential Equilibrium. We reanalyze it from an algebraic theory perspective.
翻译:对抗性逆向强化学习(AIRL)是模仿学习中的基石方法,然而先前研究对其提出了诸多批评。本文重新审视AIRL并回应这些批评。批评一在于策略模仿不足。我们证明在策略更新阶段(需多轮迭代)将内置算法替换为软演员-评论家(SAC)能显著提升策略模仿效率。批评二在于整合SAC后仍存在可迁移奖励恢复的性能局限。虽然发现SAC确实能大幅改善策略模仿,但其对可迁移奖励恢复产生了负面影响。我们证明SAC算法本身在AIRL训练过程中无法全面解耦奖励函数,并提出混合框架PPO-AIRL + SAC以实现令人满意的迁移效果。批评三在于从势均衡视角的证明存在不足。我们通过代数理论视角对其进行了重新分析。