In the context of inverse reinforcement learning (IRL) with a single expert, adversarial inverse reinforcement learning (AIRL) serves as a foundational approach to providing comprehensive and transferable task descriptions. However, AIRL faces practical performance challenges, primarily stemming from the framework's overly idealized decomposability condition, the unclear proof regarding the potential equilibrium in reward recovery, or questionable robustness in high-dimensional environments. This paper revisits AIRL in \textbf{high-dimensional scenarios where the state space tends to infinity}. Specifically, we first establish a necessary and sufficient condition for reward transferability by examining the rank of the matrix derived from subtracting the identity matrix from the transition matrix. Furthermore, leveraging random matrix theory, we analyze the spectral distribution of this matrix, demonstrating that our rank criterion holds with high probability even when the transition matrices are unobservable. This suggests that the limitations on transfer are not inherent to the AIRL framework itself, but are instead related to the training variance of the reinforcement learning algorithms employed within it. Based on this insight, we propose a hybrid framework that integrates on-policy proximal policy optimization in the source environment with off-policy soft actor-critic in the target environment, leading to significant improvements in reward transfer effectiveness.
翻译:在单一专家情境下的逆强化学习(IRL)中,对抗逆强化学习(AIRL)作为一种基础性方法,旨在提供全面且可迁移的任务描述。然而,AIRL在实际应用中面临性能挑战,主要源于该框架过于理想化的可分解性条件、关于奖励恢复潜在均衡性的证明不明确,或在高维环境中鲁棒性存疑。本文重新审视了**状态空间趋于无穷的高维场景下**的AIRL。具体而言,我们首先通过分析由转移矩阵减去单位矩阵所得矩阵的秩,建立了奖励可迁移性的充要条件。进一步,借助随机矩阵理论,我们分析了该矩阵的谱分布,证明即使转移矩阵不可观测,我们的秩判据仍以高概率成立。这表明迁移性限制并非源于AIRL框架本身,而是与其内部采用的强化学习算法的训练方差相关。基于此洞见,我们提出了一种混合框架,该框架在源环境中集成同策略近端策略优化算法,在目标环境中集成异策略软演员-评论家算法,从而显著提升了奖励迁移的有效性。