Imitation learning methods are used to infer a policy in a Markov decision process from a dataset of expert demonstrations by minimizing a divergence measure between the empirical state occupancy measures of the expert and the policy. The guiding signal to the policy is provided by the discriminator used as part of an versarial optimization procedure. We observe that this model is prone to absorbing spurious correlations present in the expert data. To alleviate this issue, we propose to use causal invariance as a regularization principle for adversarial training of these models. The regularization objective is applicable in a straightforward manner to existing adversarial imitation frameworks. We demonstrate the efficacy of the regularized formulation in an illustrative two-dimensional setting as well as a number of high-dimensional robot locomotion benchmark tasks.
翻译:模仿学习方法通过最小化专家与策略的经验状态占据测度之间的散度指标,从专家示范数据集中推断马尔可夫决策过程中的策略。该策略的引导信号由对抗优化过程中使用的判别器提供。我们观察到该模型容易吸收专家数据中的虚假相关性。为解决此问题,我们提出将因果不变性作为对抗训练这些模型的正则化原则。该正则化目标可直接应用于现有对抗模仿学习框架。我们通过在说明性二维环境以及多个高维机器人运动基准任务中验证了正则化公式的有效性。