基于后继特征匹配的非对抗性逆强化学习 (Non-Adversarial Inverse Reinforcement Learning via Successor Feature Matching)

In inverse reinforcement learning (IRL), an agent seeks to replicate expert demonstrations through interactions with the environment. Traditionally, IRL is treated as an adversarial game, where an adversary searches over reward models, and a learner optimizes the reward through repeated RL procedures. This game-solving approach is both computationally expensive and difficult to stabilize. In this work, we propose a novel approach to IRL by direct policy optimization: exploiting a linear factorization of the return as the inner product of successor features and a reward vector, we design an IRL algorithm by policy gradient descent on the gap between the learner and expert features. Our non-adversarial method does not require learning a reward function and can be solved seamlessly with existing actor-critic RL algorithms. Remarkably, our approach works in state-only settings without expert action labels, a setting which behavior cloning (BC) cannot solve. Empirical results demonstrate that our method learns from as few as a single expert demonstration and achieves improved performance on various control tasks.

翻译：在逆强化学习（IRL）中，智能体旨在通过与环境的交互来模仿专家的演示行为。传统上，IRL被视作一种对抗性博弈，其中对手在奖励模型空间中进行搜索，而学习者通过重复的强化学习过程来优化奖励。这种博弈求解方法不仅计算成本高昂，而且难以稳定。在本工作中，我们提出了一种通过直接策略优化的新型IRL方法：利用回报的线性分解（即后继特征与奖励向量的内积），我们通过在学习者与专家特征之间的差距上执行策略梯度下降，设计了一种IRL算法。我们的非对抗性方法无需学习奖励函数，并且可以与现有的演员-评论家强化学习算法无缝结合求解。值得注意的是，该方法适用于仅含状态信息、无专家动作标签的场景，而行为克隆（BC）无法解决此类问题。实验结果表明，我们的方法能够仅从单条专家演示中学习，并在多种控制任务上取得了更优的性能。