In this paper, we introduce MAAD, a novel, sample-efficient on-policy algorithm for Imitation Learning from Observations. MAAD utilizes a surrogate reward signal, which can be derived from various sources such as adversarial games, trajectory matching objectives, or optimal transport criteria. To compensate for the non-availability of expert actions, we rely on an inverse dynamics model that infers plausible actions distribution given the expert's state-state transitions; we regularize the imitator's policy by aligning it to the inferred action distribution. MAAD leads to significantly improved sample efficiency and stability. We demonstrate its effectiveness in a number of MuJoCo environments, both int the OpenAI Gym and the DeepMind Control Suite. We show that it requires considerable fewer interactions to achieve expert performance, outperforming current state-of-the-art on-policy methods. Remarkably, MAAD often stands out as the sole method capable of attaining expert performance levels, underscoring its simplicity and efficacy.
翻译:本文提出MAAD——一种新颖且样本高效的在线策略算法,用于从观测中进行模仿学习。MAAD利用替代奖励信号,该信号可来源于对抗博弈、轨迹匹配目标或最优传输准则等多种途径。为解决专家动作不可用的问题,我们依赖逆动力学模型,该模型根据专家的状态-状态转移推理合理的动作分布;通过将模仿策略与该推理动作分布对齐,我们对其施加正则化约束。MAAD显著提升了样本效率与稳定性。我们在多个MuJoCo环境(包括OpenAI Gym和DeepMind控制套件)中验证了其有效性。研究表明,MAAD需显著更少的交互次数即可达到专家性能,优于当前最先进的在线策略方法。值得注意的是,MAAD常作为唯一能够达到专家性能水平的方法脱颖而出,彰显其简洁性与高效性。